Paperid:1
Authors:Aaditya Ramdas
Abstract:
Sequential anytimevalid inference (SAVI) provides measures of statistical evidence and uncertainty --- e-values and e-processes for testing and confidence sequences for estimation --- that remain valid at all stopping times. These allow for continuous monitoring and analysis of accumulating data and optional stopping for any reason. These methods crucially rely on nonnegative martingales, which are wealth processes of a player in a betting game, thus yielding the area of "game-theoretic statistics". This tutorial will present the game-theoretic philosophy, intuition, language and mathematics behind SAVI, summarized in a recent a new book https://arxiv.org/pdf/2410.23614, to be published before ICML as the first edition of the new book series, Foundations and Trends in Statistics.
Paperid:2
Authors:Mark Tygert
Abstract:
There are many different notions of bias and fairness. When comparing subpopulations, an especially important dichotomy is between (1) equal or equitable average outcomes and (2) equal or equitable treatment. In the particular context considered here, "equal treatment" and "equal opportunity" are not too different. However, comparing the average outcome of one subpopulation to another is different and sometimes less desirable than comparing the outcomes of pairs of individuals (one individual from each subpopulation) for which the individuals in each pair are similar. The latter requires comparing outcomes via "conditioning on" or "controlling for" confounding covariates.
Paperid:3
Authors:Parikshit Ram · Benjamin Hoover · Dmitry Krotov
Abstract:
Associative Memories like the famous Hopfield Networks are elegant models for describing fully recurrent neural networks whose fundamental job is to store and retrieve information. In the past few years they experienced a surge of interest due to novel theoretical results pertaining to their information storage capabilities, and their relationship with SOTA AI architectures, such as Transformers and Diffusion Models. These connections open up possibilities for interpreting the computation of traditional AI networks through the theoretical lens of Associative Memories. Additionally, novel Lagrangian formulations of these networks make it possible to design powerful distributed models that learn useful representations and inform the design of novel architectures. This tutorial provides an approachable introduction to Associative Memories, emphasizing the modern language and methods used in this area of research, with practical handson mathematical derivations and coding notebooks.
Paperid:4
Authors:Mingzhi Wang · Chengdong Ma · Yaodong Yang
Abstract:
Large Language Model (LLM) alignment has become an increasingly critical topic in contemporary AI research, especially as LLMs continue to scale and integrate into realworld applications. Ensuring that LLMs generate outputs aligned with human values, preferences, and ethical considerations is essential for their safe and effective deployment. This tutorial aims to provide a comprehensive introduction to LLM alignment methods, offering a structured and accessible entry point for researchers and practitioners interested in the field. It will present key concepts and challenges, introduce fundamental approaches such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), and, building on these foundations, review a spectrum of refinements and variants. In addition, it will cover recent advancements in game-theoretical approach to alignment and theoretical frameworks that provide a deeper understanding of alignment methodologies. Beyond theoretical insights, the tutorial will emphasize the practical aspects of LLM alignment, illustrating how these techniques are applied in real-world scenarios and guiding participants in building intuition about alignment strategies. By the end of the tutorial, attendees will gain a solid foundation in LLM alignment, equipping them with the knowledge needed to critically engage with the field, understand current research trends, and explore future directions.
Paperid:5
Authors:Amy Zhang · Benjamin Eysenbach
Abstract:
This tutorial explores the intersection of generative AI and reinforcement learning, demonstrating how generative models can be understood as RL agents and environments, and conversely, how RL can be viewed as generative modeling. It aims to bridge the gap between these fields, showing how insights from each can enhance the other. The workshop will cover topics such as reinterpreting generative AI training through an RL lens, adapting generative AI to build new RL algorithms, and understanding how AI agents interacting with tools and humans create a new generative model. It will also discuss future directions and open problems, focusing on how RL can shape the future of foundation model training and enable generative AI systems to construct their own knowledge.
Paperid:6
Authors:Ziyu Yao · Daking Rai
Abstract:
Mechanistic interpretability (MI) is an emerging subfield of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. Given how fast this topic is now attracting the ML/AI community's attention, the goal of this tutorial is to provide a comprehensive overview of MI for LMs, including its historical contexts, the various techniques to implement and evaluate MI, findings and applications based on MI, and future challenges. The tutorial will particularly be presented following an innovative Beginner's Roadmap that the presenters carefully curated, aiming to enable researchers new to MI to quickly pick up this field and leverage MI techniques in their LM applications.
Paperid:7
Authors:Qing Qu · Yuxin Chen · Liyue Shen
Abstract:
Diffusion models have recently gained attention as a powerful class of deep generative models, achieving stateof-the-art results in data generation tasks. In a nutshell, they are designed to learn an unknown data distribution starting from Gaussian noise, mimicking the process of non-equilibrium thermodynamic diffusion. Despite their outstanding empirical successes, the mathematical and algorithmic foundations of diffusion models remain far from mature. For instance: (i) Generalization: it remains unclear how diffusion models, trained on finite samples, can generate new and meaningful data that differ from the training set; (ii) Efficiency: due to the enormous model capacity and the requirement of many sampling steps, they often suffer from slow training and sampling speeds; (iii) Controllability: it remains computationally challenging and unclear how to guide and control the content generated by diffusion models, raising challenges over controllability and safety, as well as solving inverse problems across many scientific imaging applications.
Paperid:8
Authors:Natalia Ponomareva · Sergei Vassilvitskii · Peter Kairouz · Alex Bie
Abstract:
This tutorial focuses on the increasingly important area of differentially private (DP) synthetic data generation, addressing the need for robust anonymization in machine learning. Creating DP synthetic data allows for data sharing and analysis without compromising individuals' privacy, opening up possibilities for collaborative research and model training. The tutorial aims to bridge the gap between various related fields, such as DP training, DP inference, and empirical privacy testing, providing a comprehensive guide for generating DP synthetic data across different data types.
Paperid:9
Authors:Hamed Hassani · Amin Karbasi · Alexander Robey · Yaron Singer
Abstract:
Since the inception of language models (LLMs), considerable attention has been directed toward the field of AI safety. These efforts aim to identify a range of best practices—including evaluation protocols, defense algorithms, and content filters—that facilitate the ethical, trustworthy, and reliable deployment of LLMs and related technologies. A key component of AI safety is model alignment, a broad concept referring to algorithms that optimize the outputs of LLMs to align with human values. And yet, despite these efforts, recent research has identified several failure modes—referred to as jailbreaks—that circumvent LLM alignment by eliciting unsafe content from a targeted model. And while initial jailbreaks targeted the generation of harmful information (e.g., copyrighted or illegal material), modern attacks seek to elicit domainspecific harms, such as digital agents violating user privacy and LLM-controlled robots performing harmful actions in the physical world. In the worst case, future attacks may target self-replication or power seeking behaviors. The insidious nature of jailbreaking attacks represents a substantial obstacle to the broad adoption of LLMs. Therefore, it is critical for the machine learning community to study these failure modes and develop effective defense strategies that counteract them.
Paperid:10
Authors:qiang liu
Abstract:
Continuoustime generative models—particularly diffusion- and flow-based models—have emerged as a dominant paradigm in generative AI, with applications in image, video, molecular, and audio synthesis, as well as scientific modeling. Despite their success, the field’s rich mathematical structure, varied terminology, and subtle theoretical foundations often lead to confusion and fragmented understanding.
Paperid:11
Authors:Leena Chennuru Vankadara
Abstract:
At the heart of deep learning’s transformative impact lies the concept of scaleencompassing both data and computational resources, as well as their interaction with neural network architectures.
Paperid:12
Authors:Jiaoda Li
Abstract:
The formal basis of the theory of computation lies in the study of languages, subsets of Σ, the set of all strings over an alphabet Σ. Models of computation can be taxonomized into the languages they can decide on, i.e., which languages a model can be used to determine membership of. For instance, finitestate automata can decide membership in the regular languages. Language models are probabilistic generalizations of language where the notion of a set is relaxed into one of a probability distribution over Σ. Recently, language models parameterized using recurrent neural networks, transformers, and state-space models have achieved enormous success in natural language processing. Similarly to how theorists have taxonomized models of deterministic computation, researchers have been made to taxonomize the expressivity of language models based on various architectures in terms of the distributions over strings they can represent. This tutorial presents a self-contained overview of the formal methods used to taxonomize the expressivity of language models, which encompass formal language and automata theory, various forms of formal logic, circuit complexity, and programming languages such as RASP.
For example, we illustrate how transformers, under varying assumptions, can be characterized by different fragments of formal logic.
Paperid:13
Authors:Sarah Meiklejohn · Hayden Blauzvern · Mihai Maruseac · Spencer Schrock · Laurent Simon · Ilia Shumailov
Abstract:
Powerful machine learning (ML) models are now readily available online, which creates exciting possibilities for users who lack the deep technical expertise or substantial computing resources needed to develop them. On the other hand, this type of open ecosystem comes with many risks. In this paper, we argue that the current ecosystem for open ML models contains significant supplychain risks, some of which have been exploited already in real attacks. These include an attacker replacing a model with something malicious (e.g., malware), or a model being trained using a vulnerable version of a framework or on restricted or poisoned data. We then explore how Sigstore, a solution designed to bring transparency to open-source software supply chains, can be used to bring transparency to open ML models, in terms of enabling model publishers to sign their models during different stages of development and prove properties about the datasets they use.
Paperid:14
Authors:Aviv Ovadya · Kyle Redman · Oliver Smith · Luke Thorburn · Flynn Devine · Andrew Konya · Smitha Milli · Manon Revel · Quan Ze Chen · Kevin Feng · Amy Zhang · Bilva Chandra · Michiel Bakker · Atoosa Kasirzadeh
Abstract:
This position paper argues that effectively “democratizing AI” requires democratic governance and alignment of AI, and that this is particularly valuable for decisions with systemic societal impacts. Initial steps — such as Meta’s Community Forums and Anthropic’s Collective Constitutional AI — have illustrated a promising direction, where democratic processes might be used to meaningfully improve public involvement and trust in critical decisions. To more concretely explore what increasingly democratic AI might look like, we provide a “Democracy Levels” framework and associated tools that: (i) define milestones toward meaningfully democratic, pluralistic, humancentered, and public-interest AI, (ii) can help guide organizations seeking to increase the legitimacy of their decisions on difficult AI governance and alignment questions, and (iii) support evaluation of such efforts.
Paperid:15
Authors:Yedi Zhang · Yufan Cai · Xinyue Zuo · Xiaokun Luan · Kailong Wang · Zhe Hou · Yifan Zhang · Zhiyuan Wei · Meng Sun · Jun Sun · Jing Sun · Jin Song Dong
Abstract:
Abstract:Large Language Models (LLMs) have emerged as a transformative AI paradigm, profoundly influencing daily life through their exceptional language understanding and contextual generation capabilities. Despite their remarkable performance, LLMs face a critical challenge: the propensity to produce unreliable outputs due to the inherent limitations of their learningbased nature. Formal methods (FMs), on the other hand, are a well-established computation paradigm that provides mathematically rigorous techniques for modeling, specifying, and verifying the correctness of systems. FMs have been extensively applied in mission-critical software engineering, embedded systems, and cybersecurity. However, the primary challenge impeding the deployment of FMs in real-world settings lies in their steep learning curves, the absence of user-friendly interfaces, and issues with efficiency and adaptability. This position paper outlines a roadmap for advancing the next generation of trustworthy AI systems by leveraging the mutual enhancement of LLMs and FMs. First, we illustrate how FMs, including reasoning and certification techniques, can help LLMs generate more reliable and formally certified outputs. Subsequently, we highlight how the advanced learning capabilities and adaptability of LLMs can significantly enhance the usability, efficiency, and scalability of existing FM tools. Finally, we show that unifying these two computation paradigms $-$ integrating the flexibility and intelligence of LLMs with the rigorous reasoning abilities of FMs $-$ has transformative potential for the development of trustworthy AI software systems. We acknowledge that this integration has the potential to enhance both the trustworthiness and efficiency of software engineering practices while fostering the development of intelligent FM tools capable of addressing complex yet real-world challenges.
Paperid:16
Authors:Kirsten Morehouse · Siddharth Swaroop · Weiwei Pan
Abstract:
We argue that the new frontier of LLM bias probing can (and should) benefit from decades of social science research. While many LLM probes are direct applications of human ones, there remains a disconnect between psychological theory and LLM bias research. This introduces significant challenges: (1) we lack principled criteria for choosing appropriate probes, (2) we lack a system for reconciling conflicting results across probes, and (3) we lack formal frameworks for reasoning about when (and why) probe results will generalize to real user behavior. Existing taxonomies do not resolve these challenges. In this position paper, we describe actionable directions for systematizing LLM bias studies by connecting probes and the constructs they are intended to measure. We introduce EcoLevels – a framework grounded in social science for classifying and comparing bias probes. EcoLevels organizes probes along two features: the (abstraction) level at which bias is assessed and ecological validity (i.e. probetask generalization). We argue that EcoLevels can prevent end-users from drawing incorrect insights from LLM bias probing
Paperid:17
Authors:Sarah Hartman · Cheng Soon Ong · Julia Powles · Petra Kuhnert
Abstract:
This position paper argues that achieving meaningful scientific and societal advances with artificial intelligence (AI) requires a responsible, applicationdriven approach (RAD) to AI research. As AI is increasingly integrated into society, AI researchers must engage with the specific contexts where AI is being applied. This includes being responsive to ethical and legal considerations, technical and societal constraints, and public discourse. We present the case for RAD-AI to drive research through a three-staged approach: (1) building transdisciplinary teams and people-centred studies; (2) addressing context-specific methods, ethical commitments, assumptions, and metrics; and (3) testing and sustaining efficacy through staged testbeds and a community of practice. We present a vision for the future of application-driven AI research to unlock new value through technically feasible methods that are adaptive to the contextual needs and values of the communities they ultimately serve.
Paperid:18
Authors:Yoonsoo Nam · Seok Hyeong Lee · Clémentine Dominé · Yeachan Park · Charles London · Wonyl Choi · Niclas Göring · Seungjai Lee
Abstract:
In physics, complex systems are often simplified into minimal, solvable models that retain only the core principles. In machine learning, layerwise linear models (e.g., linear neural networks) act as simplified representations of neural network dynamics. These models follow the dynamical feedback principle, which describes how layers mutually govern and amplify each other's evolution. This principle extends beyond the simplified models, successfully explaining a wide range of dynamical phenomena in deep neural networks, including neural collapse, emergence, lazy and rich regimes, and grokking. In this position paper, we call for the use of layerwise linear models retaining the core principles of neural dynamical phenomena to accelerate the science of deep learning.
Paperid:19
Authors:Maya BechlerSpeicher · Ben Finkelshtein · Fabrizio Frasca · Luis Müller · Jan M Tönshoff · Antoine Siraudin · Viktor Zaverkin · Michael Bronstein · Mathias Niepert · Bryan Perozzi · Mikhail Galkin · Christopher Morris
Abstract:
While machine learning on graphs has demonstrated promise in drug design and molecular property prediction, significant benchmarking challenges hinder its further progress and relevance. Current benchmarking practices often lack focus on transformative, realworld applications, favoring narrow domains like two-dimensional molecular graphs over broader, impactful areas such as combinatorial optimization, databases, or chip design. Additionally, many benchmark datasets poorly represent the underlying data, leading to inadequate abstractions and misaligned use cases. Fragmented evaluations and an excessive focus on accuracy further exacerbate these issues, incentivizing overfitting rather than fostering generalizable insights. These limitations have prevented the development of truly useful graph foundation models. This position paper calls for a paradigm shift toward more meaningful benchmarks, rigorous evaluation protocols, and stronger collaboration with domain experts to drive impactful and reliable advances in graph learning research, unlocking the potential of graph learning.
Paperid:20
Authors:Giovanni Luca Marchetti · Vahid Shahverdi · Stefano Mereta · Matthew Trager · Kathlén Kohn
Abstract:
In this position paper, we promote the study of function spaces parameterized by machine learning models through the lens of algebraic geometry. To this end, we focus on algebraic models, e.g. neural networks with polynomial activations, whose associated function spaces are semialgebraic varieties. We outline a dictionary between algebro-geometric invariants of these varieties, such as dimension, degree, and singularities, and fundamental aspects of machine learning, such as sample complexity, expressivity, training dynamics, and implicit bias. Along the way, we review the literature and discuss ideas beyond the algebraic domain. This work lays the foundations of a research direction bridging algebraic geometry and deep learning, that we refer to as neuroalgebraic geometry.
Paperid:21
Authors:Jan Kulveit · Raymond Douglas · Nora Ammann · Deger Turan · David Krueger · David Duvenaud
Abstract:
This paper examines the systemic risks posed by incremental advancements in artificial intelligence, developing the concept of `gradual disempowerment', in contrast to the abrupt takeover scenarios commonly discussed in AI safety. We analyze how even incremental improvements in AI capabilities can undermine human influence over largescale systems that society depends on, including the economy, culture, and nation-states. As AI increasingly replaces human labor and cognition in these domains, it can weaken both explicit human control mechanisms (like voting and consumer choice) and the implicit alignments with human preferences that often arise from societal systems' reliance on human participation to function. Furthermore, AI systems may amplify existing misalignments with human preferences by optimizing these systems more powerfully. These distortions across domains may be mutually reinforcing: economic power shapes cultural narratives and political decisions, while cultural shifts alter economic and political behavior. We argue that this dynamic could lead to an effectively irreversible loss of human influence over crucial societal systems, precipitating an existential catastrophe through the permanent disempowerment of humanity. This analysis suggests the need for both technical research and governance approaches that specifically address the risk of incremental erosion of human influence across interconnected societal systems.
Paperid:22
Authors:Jaeho Kim · Yunseok Lee · Seulki Lee
Abstract:
The peer review process in major artificial intelligence (AI) conferences faces unprecedented challenges with the surge of paper submissions (exceeding 10,000 submissions per venue), accompanied by growing concerns over review quality and reviewer responsibility. This position paper argues forthe need to transform the traditional oneway review system into a bi-directional feedback loop where authors evaluate review quality and reviewers earn formal accreditation, creating an accountability framework that promotes a sustainable, high-quality peer review system.The current review system can be viewed as an interaction between three parties: the authors, reviewers, and system (i.e., conference), where we posit that all three parties share responsibility for the current problems. However, issues with authors can only be addressed through policy enforcement and detection tools, and ethical concerns can only be corrected through self-reflection. As such, this paper focuses on reforming reviewer accountability with systematic rewards through two key mechanisms: (1) a two-stage bi-directional review system that allows authors to evaluate reviews while minimizing retaliatory behavior, (2) a systematic reviewer reward system that incentivizes quality reviewing. We ask the community's strong interest in these problems and the reforms that are needed to enhance the peer review process.
Paperid:23
Authors:Jaeho Kim · Yunseok Lee · Seulki Lee
Abstract:
The peer review process in major artificial intelligence (AI) conferences faces unprecedented challenges with the surge of paper submissions (exceeding 10,000 submissions per venue), accompanied by growing concerns over review quality and reviewer responsibility. This position paper argues forthe need to transform the traditional oneway review system into a bi-directional feedback loop where authors evaluate review quality and reviewers earn formal accreditation, creating an accountability framework that promotes a sustainable, high-quality peer review system.The current review system can be viewed as an interaction between three parties: the authors, reviewers, and system (i.e., conference), where we posit that all three parties share responsibility for the current problems. However, issues with authors can only be addressed through policy enforcement and detection tools, and ethical concerns can only be corrected through self-reflection. As such, this paper focuses on reforming reviewer accountability with systematic rewards through two key mechanisms: (1) a two-stage bi-directional review system that allows authors to evaluate reviews while minimizing retaliatory behavior, (2) a systematic reviewer reward system that incentivizes quality reviewing. We ask the community's strong interest in these problems and the reforms that are needed to enhance the peer review process.
Paperid:24
Authors:Marta An Kimmel · Mueed Rehman · Yonatan Bisk · Gary Fedder
Abstract:
In this paper, we examine the manufacturability gap in stateof-the-art generative models for 3D object representations. Many models for generating 3D assets focused on rendering virtual content and do not consider the constraints of real-world manufacturing, such as milling, casting, or injection molding. We demonstrate existing generative models for computer-aided design representation do not generalize to datasets other than those used for training or to unmodified real, human-created objects. We identify current limitations with the current approaches, including missing manufacturing-readable semantics and ontology, the inability to decompose complex shapes into parameterized segments appropriate for computer-aided manufacturing, and a lack of appropriate scoring metrics to assess the generated output versus the true desired output from the dataset. The academic community could greatly impact real-world manufacturing by rallying around pathways to solve these challenges. We offer revised, more realistic datasets and baseline benchmarks as a step in targeting the challenge. In evaluating these datasets, we find that existing models are severely overfit to simpler data.
Paperid:25
Authors:Sayash Kapoor · Noam Kolt · Seth Lazar
Abstract:
Language model agents could reshape how users navigate and act in digital environments. If controlled by platform companieswhich already dominate online search, communication, and commerce—platform agentswould intensify surveillance, exacerbate user lock-in, and further entrench the incumbent digital giants. This position paper argues that to resist the undesirable effects of platform agents, we should championagent advocates—agents that are controlled by users, serve the interests of users, and preserve user autonomy and choice. We identify key interventions to enable agent advocates: ensuring access to compute, developing interoperability protocols and safety standards, and implementing appropriate market regulations.
Paperid:26
Authors:Serena Booth
Abstract:
Consumer protection laws are designed to protect consumers from unethical business practices. In this position paper, I argue that these laws serve an emergent dual purpose: if appropriately enforced and strengthened, consumer protection laws can serve as an inalienable defense for AI safety. These laws are well established and can be enforced and strengthened to incentivize businesses to design and deploy safer AI systems. This position runs counter to two prevailing trends in AI policy. The first alternative position is that AI safety requires an entirely new set of focused laws to protect humanity's prosperity. Though I find these efforts valuable, I argue that such focused laws are both hard to write and easy to skirt. The second alternative position is that consumer protection is nothing more than red tape; I argue that existing laws dating back many decades have already reigned in some nefarious business practices related to the development and deployment of AI, and that the litigious society of the United States is wellpositioned to use consumer protection laws to encourage new AI safety guardrails. This paper takes a tour of some existing consumer protection laws in the United States and their effects on the development and use of AI systems.
Paperid:27
Authors:Abeer Badawi · Md Tahmid Rahman Laskar · Jimmy Huang · Shaina Raza · Elham Dolatabadi
Abstract:
This position paper argues for a fundamental shift in how Large Language Models (LLMs) are integrated into the mental health care domain. We advocate for their role as cocreators rather than mere assistive tools. While LLMs have the potential to enhance accessibility, personalization, and crisis intervention, their adoption remains limited due to concerns about bias, evaluation, over-reliance, dehumanization, and regulatory uncertainties. To address these challenges, we propose two structured pathways: SAFE-I Implementation Guidelines for ethical and responsible deployment, and HAAS-E Evaluation Framework for multidimensional, human-centered assessment. SAFE-I provides a blueprint for data governance, adaptive model engineering, and real-world integration, ensuring LLMs align with clinical and ethical standards. HAAS-E introduces evaluation metrics that go beyond technical accuracy to measure trustworthiness, empathy, cultural sensitivity, and actionability. We call for the adoption of these structured approaches to establish a responsible and scalable model for LLM-driven mental health support, ensuring that AI complements—rather than replaces—human expertise.
Paperid:28
Authors:Yanzhou Pan · Jiayi Chen · Jiamin Chen · Zhaozhuo Xu · Denghui Zhang
Abstract:
The infringement risks of LLMs have raised significant copyright concerns across different stages of the model lifecycle. While current methods often address these issues separately, this position paper argues that the LLM copyright challenges are inherently connected, and independent optimization of these solutions leads to theoretical bottlenecks. Building on this insight, we further argue that managing LLM copyright risks requires a systemic approach rather than fragmented solutions. In this paper, we analyze the limitations of existing methods in detail and introduce an iterative onlineoffline joint optimization framework to effectively manage complex LLM copyright risks. We demonstrate that this framework offers a scalable and practical solution to mitigate LLM infringement risks, and also outline new research directions that emerge from this perspective.
Paperid:29
Authors:Kevin Wei · Patricia Paskov · Sunishchal Dev · Michael Byun · Anka Reuel · Xavier Roberts-Gaal · Rachel Calcott · Evie Coxon · Chinmay Deshpande
Abstract:
This position paper argues that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance.Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "superhuman" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework for assessing human baselining methods. We then use our framework to systematically review 113 human baselines in foundation model evaluations, identifying shortcomings in existing baselining methods. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: [GITHUB LINK].
Paperid:30
Authors:Jing Yang
Abstract:
The rapid growth of submissions to toptier Artificial Intelligence (AI) and Machine Learning (ML) conferences has prompted many venues to transition from closed to open review platforms. Some have fully embraced open peer reviews, allowing public visibility throughout the process, while others adopt hybrid approaches, such as releasing reviews only after final decisions or keeping reviews private despite using open peer review systems. In this work, we analyze the strengths and limitations of these models, highlighting the growing community interest in transparent peer review. To support this discussion, we examine insights from Paper Copilot, a website launched two years ago to aggregate and analyze AI / ML conference data while engaging a global audience. The site has attracted over 200,000 early-career researchers, particularly those aged 18–34 from 177 countries, many of whom are actively engaged in the peer review process. \textit{Drawing on our findings, this position paper advocates for a more transparent, open, and well-regulated peer review aiming to foster greater community involvement and propel advancements in the field.
Paperid:31
Authors:Sunayana Rane · Cyrus Kirkman · Graham Todd · Amanda Royka · Ryan Law · Erica Cartmill · Jacob Foster
Abstract:
It has become increasingly challenging to understand and evaluate LLM capabilities as these models exhibit a broader range of behaviors. In this position paper, we argue that LLM researchers should draw on the lessons from another field which has developed a rich set of experimental paradigms and design practices for probing the behavior of complex intelligent systems: animal cognition. We present five core principles of evaluation drawn from animal cognition research, and explain how they provide invaluable guidance for understanding LLM capabilities and behavior. We ground these principles in an empirical case study, and show how they can already provide a richer picture of one particular reasoning capability: transitive inference.
Paperid:32
Authors:Sunayana Rane · Cyrus Kirkman · Graham Todd · Amanda Royka · Ryan Law · Erica Cartmill · Jacob Foster
Abstract:
It has become increasingly challenging to understand and evaluate LLM capabilities as these models exhibit a broader range of behaviors. In this position paper, we argue that LLM researchers should draw on the lessons from another field which has developed a rich set of experimental paradigms and design practices for probing the behavior of complex intelligent systems: animal cognition. We present five core principles of evaluation drawn from animal cognition research, and explain how they provide invaluable guidance for understanding LLM capabilities and behavior. We ground these principles in an empirical case study, and show how they can already provide a richer picture of one particular reasoning capability: transitive inference.
Paperid:33
Authors:Ruth Elisabeth Appel
Abstract:
There is strong agreement that generative AI should be regulated, but strong disagreement on how to approach regulation. While some argue that AI regulation should mostly rely on extensions of existing laws, others argue that entirely new laws and regulations are needed to ensure that generative AI benefits society. In this position paper, we argue that the debates on generative AI regulation can be informed by the debates and evidence on social media regulation. For example, AI companies have faced allegations of political bias regarding the images and text their models produce, similar to the allegations social media companies have faced regarding content ranking on their platforms. First, we compare and contrast the affordances of generative AI and social media to highlight their similarities and differences. Then, we discuss specific policy recommendations based on the evolution of social media and their regulation. These recommendations include investments in (1) efforts to counter bias and perceptions thereof (e.g., via transparency, researcher access, oversight boards, democratic input, research studies), (2) specific areas of regulatory concern (e.g., youth wellbeing, election integrity) and trust and safety, (3) computational social science research, and (4) a more global perspective. Applying lessons learnt from social media regulation to generative AI regulation can save effort and time, and prevent avoidable mistakes.
Paperid:34
Authors:Ruth Elisabeth Appel
Abstract:
There is strong agreement that generative AI should be regulated, but strong disagreement on how to approach regulation. While some argue that AI regulation should mostly rely on extensions of existing laws, others argue that entirely new laws and regulations are needed to ensure that generative AI benefits society. In this position paper, we argue that the debates on generative AI regulation can be informed by the debates and evidence on social media regulation. For example, AI companies have faced allegations of political bias regarding the images and text their models produce, similar to the allegations social media companies have faced regarding content ranking on their platforms. First, we compare and contrast the affordances of generative AI and social media to highlight their similarities and differences. Then, we discuss specific policy recommendations based on the evolution of social media and their regulation. These recommendations include investments in (1) efforts to counter bias and perceptions thereof (e.g., via transparency, researcher access, oversight boards, democratic input, research studies), (2) specific areas of regulatory concern (e.g., youth wellbeing, election integrity) and trust and safety, (3) computational social science research, and (4) a more global perspective. Applying lessons learnt from social media regulation to generative AI regulation can save effort and time, and prevent avoidable mistakes.
Paperid:35
Authors:Oliver Eberle · Thomas McGee · Hamza Giaffar · Taylor Webb · Ida Momennejad
Abstract:
What algorithms do LLMs actually learn and use to solve problems? Studies addressing this question are sparse, as research priorities are focused on improving performance through scale, leaving a theoretical and empirical gap in understanding emergent algorithms in LLMs. This position paper proposes AlgEval: a framework for systematic research into the algorithms that LLMs learn and use. AlgEval aims to uncover both algorithmic primitives, implemented via latent representations, attention weights, and inferencetime compute--and algorithmic composition--how these primitives are selected and combined to solve task-specific problems. We highlight potential methodological paths and a case study toward this goal, focusing on emergent search algorithms. Our case study illustrates both the formation of top-down hypotheses about candidate algorithms, and bottom-up tests of these hypotheses via circuit-level analysis of attention patterns and hidden states. The rigorous, systematic evaluation of how LLMs actually solve tasks provides an alternative to resource-intensive scaling, reorienting the field toward a principled understanding of underlying computations. This can in turn lead to more sample-efficient methods for training and improving performance, as well as novel architectures for end-to-end and multi-agent systems.
Paperid:36
Authors:Miriam Doh · Benedikt Höltgen · Piera Riccio · Nuria Oliver
Abstract:
This position paper critiques the reliance on rigid racial taxonomies in machine learning, exposing their U.S.centric nature and lack of global applicability—particularly in Europe, where race categories are not commonly used. These classifications oversimplify racial identity, erasing the experiences of mixed-race individuals and reinforcing outdated essentialist views that contradict the social construction of race. We suggest research agendas in machine learning that move beyond categorical variables to better address discrimination and social inequality.
Paperid:37
Authors:Alan Jeffares · M van der Schaar
Abstract:
Developing a better understanding of surprising or counterintuitive phenomena has constituted a significant portion of deep learning research in recent years. These include double descent, grokking, and the lottery ticket hypothesis among many others. A natural motivation for this body of work appears to arise from the need to "resolve" or "understand" these phenomena on a narrow case-by-case basis. This position paper asserts that, in many prominent cases, there is little evidence to suggest that these phenomena appear in real-world applications and these efforts may be inefficient in driving progress in the broader field. In response, we advocate for a more intentionallypragmaticmethodology, prioritizing a focus on practical utility. We argue that they should be studied in a manner similar to natural phenomena (as in e.g. physics) where they can act as valuable challenges to our broader intuitions by taking ascientificapproach. This is in contrast to viewing them as individual challenges that require bespoke resolutions or explanations. This position is reinforced by analyzing the research outcomes of several prominent examples of these phenomena from the recent literature. Finally, we revisit the current norms in the research community in approaching these problems and propose practical recommendations for future research aiming to ensure that progress on deep learning phenomena is well aligned with the ultimate goal of progress in the broader field of deep learning.
Paperid:38
Authors:Alan Jeffares · M van der Schaar
Abstract:
Developing a better understanding of surprising or counterintuitive phenomena has constituted a significant portion of deep learning research in recent years. These include double descent, grokking, and the lottery ticket hypothesis among many others. A natural motivation for this body of work appears to arise from the need to "resolve" or "understand" these phenomena on a narrow case-by-case basis. This position paper asserts that, in many prominent cases, there is little evidence to suggest that these phenomena appear in real-world applications and these efforts may be inefficient in driving progress in the broader field. In response, we advocate for a more intentionallypragmaticmethodology, prioritizing a focus on practical utility. We argue that they should be studied in a manner similar to natural phenomena (as in e.g. physics) where they can act as valuable challenges to our broader intuitions by taking ascientificapproach. This is in contrast to viewing them as individual challenges that require bespoke resolutions or explanations. This position is reinforced by analyzing the research outcomes of several prominent examples of these phenomena from the recent literature. Finally, we revisit the current norms in the research community in approaching these problems and propose practical recommendations for future research aiming to ensure that progress on deep learning phenomena is well aligned with the ultimate goal of progress in the broader field of deep learning.
Paperid:39
Authors:Jacy Anthis · Ryan Liu · Sean Richardson · Austin Kozlowski · Bernard Koch · Erik Brynjolfsson · James Evans · Michael Bernstein
Abstract:
Accurate and verifiable large language model (LLM) simulations of human research subjects promise an accessible data source for understanding human behavior and training new AI systems. However, results to date have been limited, and few social scientists have adopted these methods. In this position paper, we argue that the promise of LLM social simulations can be achieved by addressing five tractable challenges. We ground our argument in a literature survey of empirical comparisons between LLMs and human research subjects, commentaries on the topic, and related work. We identify promising directions with prompting, finetuning, and complementary methods. We believe that LLM social simulations can already be used for exploratory research, such as pilot experiments for psychology, economics, sociology, and marketing. More widespread use may soon be possible with rapidly advancing LLM capabilities, and researchers should prioritize developing conceptual models and evaluations that can be iteratively deployed and refined at pace with ongoing AI advances.
Paperid:40
Authors:Min-Hsuan Yeh · Jeffrey Wang · Xuefeng Du · Seongheon Park · Leitian Tao · Shawn Im · Sharon Li
Abstract:
As AI systems become increasingly capable and influential, ensuring their alignment with human values, preferences, and goals has become a critical research focus. Current alignment methods primarily focus on designing algorithms and loss functions but often underestimate the crucial role of data. This paper advocates for a shift towards datacentric AI alignment, emphasizing the need to enhance the quality and representativeness of data used in aligning AI systems. In this position paper, we highlight key challenges associated with both human-based and AI-based feedback within the data-centric alignment framework. Through qualitative analysis, we identify multiple sources of unreliability in human feedback, as well as problems related to temporal drift, context dependence, and AI-based feedback failing to capture human values due to inherent model limitations. We propose future research directions, including improved feedback collection practices, robust data-cleaning methodologies, and rigorous feedback verification processes. We call for future research into these critical directions to ensure, addressing gaps that persist in understanding and improvingdata-centric alignment practices.
Paperid:41
Authors:Sebastian Bordt · Eric Raidl · Ulrike Luxburg
Abstract:
In the rapidly growing literature on explanation algorithms, it often remains unclear what precisely these algorithms are for and how they should be used. In this position paper, we argue for a novel and pragmatic perspective: Explainable machine learning needs to recognize its parallels with applied statistics. Concretely, explanations are statistics of highdimensional functions, and we should think about them analogously to traditional statistical quantities. Among others, this implies that we must think carefully about the matter of interpretation, or how the explanations relate to intuitive questions that humans have about the world. The fact that this is scarcely being discussed in research papers is one of the main drawbacks of the current literature. Luckily, the analogy between explainable machine learning and applied statistics provides a fruitful way for how research practices can be improved.
Paperid:42
Authors:John Hewitt · Robert Geirhos · Been Kim
Abstract:
This position paper argues that, in order to understand AI, we cannot rely on our existing vocabulary of human words. Instead, we shouldstrive to develop neologisms: new words thatrepresent precise human concepts that we wantto teach machines, or machine concepts that weneed to learn. We start from the premise thathumans and machines have differing concepts.This means interpretability can be framed as acommunication problem: humans must be able toreference and control machine concepts, and communicate human concepts to machines. Creatinga shared humanmachine language through developing neologisms, we believe, could solve thiscommunication problem. Successful neologismsachieve a useful amount of abstraction: not toodetailed, so they’re reusable in many contexts, andnot too high-level, so they convey precise information. As a proof of concept, we demonstrate howa “length neologism” enables controlling LLMresponse length, while a “diversity neologism” allows sampling more variable responses. Takentogether, we argue that we cannot understand AIusing our existing vocabulary, and expanding itthrough neologisms creates opportunities for bothcontrolling and understanding machines better.
Paperid:43
Authors:Ahmed Alaa · Thomas Hartvigsen · Niloufar Golchini · Shiladitya Dutta · Frances Dean · Inioluwa Raji · Travis Zack
Abstract:
Medical large language models (LLMs) research often makes bold claims, from encoding clinical knowledge to reasoning like a physician. These claims are usually backed by evaluation on competitive benchmarksa tradition inherited from mainstream machine learning. But how do we separate real progress from a leaderboard flex? Medical LLM benchmarks, much like those in other fields, are arbitrarily constructed using medical licensing exam questions. For these benchmarks to truly measure progress, they must accurately capture the real-world tasks they aim to represent. In this position paper, we argue that medical LLM benchmarks should—and indeed can—be empirically evaluated for their construct validity. In psychology, "construct validity" refers to how well a test measures the concept it is meant to assess. By drawing an analogy between LLM benchmarks and psychological tests, we explain how frameworks from this field can provide empirical foundations for validating benchmarks. To put this into practice, we use real-world clinical data in proof-of-concept experiments to evaluate popular medical LLM benchmarks and report gaps in their construct validity. Finally, we outline a vision for a new ecosystem of medical LLM evaluation centered around the creation of valid benchmarks.
Paperid:44
Authors:Ahmed Alaa · Thomas Hartvigsen · Niloufar Golchini · Shiladitya Dutta · Frances Dean · Inioluwa Raji · Travis Zack
Abstract:
Medical large language models (LLMs) research often makes bold claims, from encoding clinical knowledge to reasoning like a physician. These claims are usually backed by evaluation on competitive benchmarksa tradition inherited from mainstream machine learning. But how do we separate real progress from a leaderboard flex? Medical LLM benchmarks, much like those in other fields, are arbitrarily constructed using medical licensing exam questions. For these benchmarks to truly measure progress, they must accurately capture the real-world tasks they aim to represent. In this position paper, we argue that medical LLM benchmarks should—and indeed can—be empirically evaluated for their construct validity. In psychology, "construct validity" refers to how well a test measures the concept it is meant to assess. By drawing an analogy between LLM benchmarks and psychological tests, we explain how frameworks from this field can provide empirical foundations for validating benchmarks. To put this into practice, we use real-world clinical data in proof-of-concept experiments to evaluate popular medical LLM benchmarks and report gaps in their construct validity. Finally, we outline a vision for a new ecosystem of medical LLM evaluation centered around the creation of valid benchmarks.
Paperid:45
Authors:Yan Shvartzshnaider · Vasisht Duddu
Abstract:
Machine learning community is discovering Contextual Integrity (CI) as a useful framework to assess the privacy implications of large language models (LLMs). This is an encouraging development.The CI theory emphasizes sharing information in accordance withprivacy normsand can bridge the social, legal, political, and technical aspects essential for evaluating privacy in LLMs. However, this is also a good point to reflect on use of CI for LLMs.This position paper argues that existing literature adopts CI for LLMs without embracing the theory’s fundamental tenets, essentially amounting to a form of "CIwashing."CI-washing could lead to incorrect conclusions and flawed privacy-preserving designs. We clarify the four fundamental tenets of CI theory, systematize prior work on whether they deviate from these tenets, and highlight overlooked issues in experimental hygiene for LLMs (e.g., prompt sensitivity, positional bias).
Paperid:46
Authors:Sam Bowyer · Laurence Aitchison · Desi Ivanova
Abstract:
Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLTbased methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios.
Paperid:47
Authors:Ming Jin · Hyunin Lee
Abstract:
This position paper contends that modern AI research must adopt an antifragile perspective on safety—one in which the system's capacity to handle rare or outof-distribution (OOD) events adapts and expands over repeated exposures. Conventional static benchmarks and single-shot robustness tests overlook the reality that environments evolve and that models, if left unchallenged, can drift into maladaptation (e.g., reward hacking, over-optimization, or atrophy of broader capabilities). We argue that an antifragile approach---Rather than striving to rapidly reduce current uncertainties, the emphasis is on leveraging those uncertainties to better prepare for potentially greater, more unpredictable uncertainties in the future---is pivotal for the long-term reliability of open-ended ML systems. In this position paper, we first identify key limitations of static testing, including scenario diversity, reward hacking, and over-alignment. We then explore the potential of dynamic, antifragile solutions to manage rare events. Crucially, we advocate for a fundamental recalibration of the methods used to measure, benchmark, and continually improve AI safety over the long term by providing an ethical and practical guidelines to the AI safety community.
Paperid:48
Authors:Bruno Mlodozeniec · David Krueger · Richard E Turner
Abstract:
Causal inference is a key research area in machine learning, yet confusion reigns over the tools needed to tackle it. There are prevalent claims in the machine learning literature that you need a bespoke causal framework or notation to answer causal questions. In this paper, we make it clear that you can answer any causal inference question within the realm of probabilistic modelling and inference, without causalspecific tools or notation. Through concrete examples, we demonstrate how causal questions can be tackled by writing down the probability of everything. We argue for the advantages of the generality of the probabilistic modelling lens, when compared to bespoke causal frameworks. Lastly, we reinterpret causal tools as emerging from standard probabilistic modelling and inference, elucidating their necessity and utility.
Paperid:49
Authors:Bruno Mlodozeniec · David Krueger · Richard E Turner
Abstract:
Causal inference is a key research area in machine learning, yet confusion reigns over the tools needed to tackle it. There are prevalent claims in the machine learning literature that you need a bespoke causal framework or notation to answer causal questions. In this paper, we make it clear that you can answer any causal inference question within the realm of probabilistic modelling and inference, without causalspecific tools or notation. Through concrete examples, we demonstrate how causal questions can be tackled by writing down the probability of everything. We argue for the advantages of the generality of the probabilistic modelling lens, when compared to bespoke causal frameworks. Lastly, we reinterpret causal tools as emerging from standard probabilistic modelling and inference, elucidating their necessity and utility.
Paperid:50
Authors:Yucen Li · Daohan Lu · Polina Kirichenko · Shikai Qiu · Tim G. J. Rudner · C. Bayan Bruss · Andrew Wilson
Abstract:
To detect distribution shifts and improve model safety, many outof-distribution (OOD) detection methods rely on the predictive uncertainty or features of supervised models trained on in-distribution data. In this position paper, we critically re-examine this popular family of OOD detection procedures, and we argue that these methods are fundamentally answering the wrong questions for OOD detection. There is no simple fix to this misalignment, since a classifier trained only on in-distribution classes cannot be expected to identify OOD points; for instance, a cat-dog classifier may confidently misclassify an airplane if it contains features that distinguish cats from dogs, despite generally appearing nothing alike. We find that uncertainty-based methods incorrectly conflate high uncertainty with being OOD, while feature-based methods incorrectly conflate far feature-space distance with being OOD. We show how these pathologies manifest as irreducible errors in OOD detection and identify common settings where these methods are ineffective. Additionally, interventions to improve OOD detection such as feature-logit hybrid methods, scaling of model and data size, epistemic uncertainty representation, and outlier exposure also fail to address this fundamental misalignment in objectives.
Paperid:51
Authors:Yunke Wang · Yanxi Li · Chang Xu
Abstract:
AI Scaling has traditionally been synonymous with Scaling Up, which builds larger and more powerful models. However, the growing demand for efficiency, adaptability, and collaboration across diverse applications necessitates a broader perspective. This position paper presents a holistic framework for AI scaling, encompassing Scaling Up, Scaling Down, and Scaling Out. It argues that while Scaling Up of models faces inherent bottlenecks, the future trajectory of AI scaling lies in Scaling Down and Scaling Out. These paradigms address critical technical and societal challenges, such as reducing carbon footprint, ensuring equitable access, and enhancing crossdomain collaboration. We explore transformative applications in healthcare, smart manufacturing, and content creation, demonstrating how AI Scaling can enable breakthroughs in efficiency, personalization, and global connectivity. Additionally, we highlight key challenges, including balancing model complexity with interpretability, managing resource constraints, and fostering ethical development. By synthesizing these approaches, we propose a unified roadmap that redefines the future of AI research and application, paving the way for advancements toward Artificial General Intelligence (AGI).
Paperid:52
Authors:Sunayana Rane
Abstract:
AI systems are now ubiquitous in realworld decision-making. However, their use is often invisible and almost always difficult to understand for the ordinary people who now come into contact with AI regularly. As these AI-driven decision-making systems increasingly replace human counterparts, our ability to understand the reasons behind a decision, and to contest that decision fairly, is quickly being eroded. In the United States legal system due process includes the right to understand the reasons for certain major decisions and the right to openly contest those decisions. Everyone is entitled to due process under the law, and human decision-makers have been required to adhere to due process when making many important decisions that are now slowly being relegated to AI systems. Using two recent court decisions as a foundation, this paper takes the position that AI in its current form cannot guarantee due process, and therefore cannot and (should not) be used to make decisions that should be subject to due process. The supporting legal analysis investigates how the current lack of technical answers about the interpretability and causality of AI decisions, coupled with extreme trade secret protections severely limiting any exercise of the small amount of technical knowledge we do have, serve as a fatal anti-due-process combination. Finally, we discuss why technical researchers are vital to informing the legal process and protecting due process protections.
Paperid:53
Authors:D. Sculley · William Cukierski · Phil Culliton · Sohier Dane · Maggie Demkin · Ryan Holbrook · Addison Howard · Paul Mooney · Walter Reade · Meg Risdal · Nate Keating
Abstract:
In this position paper, we observe that empirical evaluation in Generative AI is at a crisis point since traditional ML evaluation and benchmarking strategies are insufficient to meet the needs of evaluating modern GenAI models and systems. There are many reasons for this, including the fact that these models typically have nearly unbounded input and output spaces, typically do not have a well defined ground truth target, and typically exhibit strong feedback loops and prediction dependence based on context of previous model outputs. On top of these critical issues, we argue that the problems ofleakageandcontaminationare in fact the most important and difficult issues to address for GenAI evaluations. Interestingly, the field of AI Competitions has developed effective measures and practices to combat leakage for the purpose of counteracting cheating by bad actors within a competition setting. This makes AI Competitions an especially valuable (but underutilized) resource. Now is time for the field to view AI Competitions as the gold standard for empirical rigor in GenAI evaluation, and to harness and harvest their results with according value.
Paperid:54
Authors:D. Sculley · William Cukierski · Phil Culliton · Sohier Dane · Maggie Demkin · Ryan Holbrook · Addison Howard · Paul Mooney · Walter Reade · Meg Risdal · Nate Keating
Abstract:
In this position paper, we observe that empirical evaluation in Generative AI is at a crisis point since traditional ML evaluation and benchmarking strategies are insufficient to meet the needs of evaluating modern GenAI models and systems. There are many reasons for this, including the fact that these models typically have nearly unbounded input and output spaces, typically do not have a well defined ground truth target, and typically exhibit strong feedback loops and prediction dependence based on context of previous model outputs. On top of these critical issues, we argue that the problems ofleakageandcontaminationare in fact the most important and difficult issues to address for GenAI evaluations. Interestingly, the field of AI Competitions has developed effective measures and practices to combat leakage for the purpose of counteracting cheating by bad actors within a competition setting. This makes AI Competitions an especially valuable (but underutilized) resource. Now is time for the field to view AI Competitions as the gold standard for empirical rigor in GenAI evaluation, and to harness and harvest their results with according value.
Paperid:55
Authors:Alex Gu · Naman Jain · WenDing Li · Manish Shetty Molahalli · Kevin Ellis · Koushik Sen · Armando Solar-Lezama
Abstract:
AI for software engineering has made remarkable progress, becoming a notable success within generative AI. Despite this, achieving fully automated software engineering is still a significant challenge, requiring research efforts across both academia and industry. In this position paper, our goal is threefold. First, we provide a taxonomy of measures and tasks to categorize work towards AI software engineering. Second, we outline the key bottlenecks permeating today's approaches. Finally, we highlight promising paths towards making progress on these bottlenecks to guide future research in this rapidly maturing field.
Paperid:56
Authors:Hanqi Yan · Linhai Zhang · Jiazheng Li · Zhenyi Shen · Yulan He
Abstract:
Large language models (LLMs) excel in many reasoning tasks but continue to face significant challenges, such as lack of robustness in reasoning, struggling with crosstask generalization, and inefficiencies in scaling up reasoning capabilities. Current training paradigms, including next-token prediction and reinforcement learning from human feedback, often fall short in adaptability to diverse reasoning tasks. Existing approaches, such as prompt optimization and iterative output refinement, offer performance improvement, but can be inefficient and lack effective generalization. To overcome these limitations, this position paper argues for a transformative shift in how LLMs approach reasoning. Drawing inspiration from cognitive science, particularly meta-reasoning theories such as Dual-Process Theory and Metacognitive Reasoning, we propose a Bayesian meta-reasoning framework for LLMs. Our approach integrates self-awareness, monitoring, evaluation, regulation, and meta-reflection, to enhance LLMs' ability to refine reasoning strategies and generalize across tasks. We revisit existing LLM reasoning methods, identify key challenges, and suggest directions for future research.
Paperid:57
Authors:Seungwook Han · Jyothish Pari · Samuel Gershman · Pulkit Agrawal
Abstract:
Large Language Models (LLMs) have demonstrated impressive realworld utility, exemplifying artificial useful intelligence (AUI). However, their ability to reason adaptively and robustly -- hallmarks of artificial general intelligence (AGI) -- remains limited. While LLMs seemingly succeed in commonsense reasoning, programming, and mathematics, they struggle to generalize algorithmic understanding across novel syntaxes and contexts. Our experiments with algorithmic tasks in esoteric programming languages reveal that LLMs' reasoning is overfitted to the training data, highlighting limitations in adaptability and transferability.In this position paper, we argue that the core issue in current LLMs lies in the coupling of reasoning and knowledge. To transition from AUI to AGI, we propose disentangling knowledge and reasoning through three key directions: (1) integrating RL during pretraining, (2) using artificial environments to learn reasoning priors and to make search in language more efficient, and (3) developing hybrid architectures that decouple memory from reasoning while maintaining dynamic interaction.
Paperid:58
Authors:Paul Youssef · Zhixue Zhao · Daniel Braun · Jörg Schlötterer · Christin Seifert
Abstract:
Large Language Models (LLMs) contain large amounts of facts about the world. These facts can become outdated over time, which has led to the development of knowledge editing methods (KEs) that can change specific facts in LLMs with limited side effects. This position paper argues that editing LLMs poses serious safety risks that have been largely overlooked. First, we note the fact that KEs are widely available, computationally inexpensive, highly performant, and stealthy makes them an attractive tool for malicious actors. Second, we discuss malicious use cases of KEs, showing how KEs can be easily adapted for a variety of malicious purposes. Third, we highlight vulnerabilities in the AI ecosystem that allow unrestricted uploading and downloading of updated models without verification. Fourth, we argue that a lack of social and institutional awareness exacerbates this risk, and discuss the implications for different stakeholders. We call on the community to (i) research tamperresistant models and countermeasures against malicious model editing, and (ii) actively engage in securing the AI ecosystem.
Paperid:59
Authors:Audrey Poinsot · Panayiotis Panayiotou · Alessandro Leite · Nicolas CHESNEAU · Özgür Şimşek · Marc Schoenauer
Abstract:
Causal machine learning has the potential to revolutionize decisionmaking by combining the predictive power of machine learning algorithms with the theory of causal inference. However, these methods remain underutilized by the broader machine learning community, as current empirical evaluations of these methods do not permit assessment of their reliability and robustness, undermining their practical utility. Specifically, one of the principal criticisms made by the community is the extensive use of synthetic experiments. On the contrary, we argue that synthetic experiments are essential and necessary to precisely assess and understand the capabilities of causal machine learning methods. To substantiate our position, we critically review the current evaluation practices, spotlight their shortcomings, and propose a set of principles for conducting rigorous empirical analyses with synthetic data. Adopting our proposed principles will enable comprehensive evaluations that build trust in causal machine learning methods, driving their broader adoption and impactful real-world use.
Paperid:60
Authors:Yan Zhuang · Qi Liu · Zachary Pardos · Patrick Kyllonen · Jiyun Zu · Zhenya Huang · Shijin Wang · Enhong Chen
Abstract:
As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various largescale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model's evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model’s observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue thatpsychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
Paperid:61
Authors:Michael Kirchhof · Gjergji Kasneci · Enkelejda Kasneci
Abstract:
Large language models (LLMs) and chatbot agents are known to provide wrong outputs at times, and it was recently found that this can never be fully prevented. Hence, uncertainty quantification plays a crucial role, aiming to quantify the level of ambiguity in either one overall number or two numbers for aleatoric and epistemic uncertainty. This position paper argues that this traditional dichotomy of uncertainties is too limited for the open and interactive setup that LLM agents operate in when communicating with a user, and that we need to research avenues that enrich uncertainties in this novel scenario. We review the literature and find that popular definitions of aleatoric and epistemic uncertainties directly contradict each other and lose their meaning in interactive LLM agent settings. Hence, we propose three novel research directions that focus on uncertainties in such humancomputer interactions: Underspecification uncertainties, for when users do not provide all information or define the exact task at the first go, interactive learning, to ask follow-up questions and reduce the uncertainty about the current context, and output uncertainties, to utilize the rich language and speech space to express uncertainties as more than mere numbers. We expect that these new ways of dealing with and communicating uncertainties will lead to LLM agent interactions that are more transparent, trustworthy, and intuitive.
Paperid:62
Authors:Andreas Haupt · Erik Brynjolfsson
Abstract:
Benchmarks and evaluations are central to machine learning methodology and direct research in the field. Current evaluations test systems in the absence of humans. This position paper argues that the machine learning community should usecentaur evaluations, in which humans and AI jointly solve tasks. Centaur Evaluations refocus machine learning development toward human augmentation instead of human replacement. They allow for direct evaluation of humancentered desiderata, such as interpretability and helpfulness, and can be more challenging and realistic than existing evaluations. By shifting the focus fromautomationtowardcollaborationbetween humans and AI, centaur evaluations can drive progress toward more effective and human-augmenting AI systems.
Paperid:63
Authors:Lionel Wong · Ayman Ali · Raymond M Xiong · Shannon Shen · Yoon Kim · Monica Agrawal
Abstract:
Patients have long sought health information online, and increasingly, they are turning to generative AI to answer their healthrelated queries. Given the high stakes of the medical domain, techniques like retrieval-augmented generation and citation grounding have been widely promoted as methods to reduce hallucinations and improve the accuracy of AI-generated responses and have been widely adopted into search engines. However, we argue that even when these methods produce literally accurate content drawn from source documents sans hallucinations, they can still be highly misleading. Patients may derive significantly different interpretations from AI-generated outputs than they would from reading the original source material, let alone consulting a knowledgeable clinician. Through a large-scale query analysis on topics including disputed diagnoses and procedure safety, we support our argument with quantitative and qualitative evidence of the suboptimal answers resulting from current systems. In particular, we highlight how these models tend to decontextualize facts, omit critical relevant sources, and reinforce patient misconceptions or biases. We propose a series of recommendations---such as the incorporation of communication pragmatics and enhanced comprehension of source documents---that could help mitigate these issues and extend beyond the medical domain.
Paperid:64
Authors:Elliot Meyerson · Xin Qiu
Abstract:
Decomposing hard problems into subproblems often makes them easier and more efficient to solve. With the high cost of running LLMs at scale, there is an increasing effort to decompose systems into sets of LLMbased agents, each of whom can be delegated sub-tasks. However, this decomposition (even when automated) is often intuitive, e.g., based on how a human might assign roles to members of a human team. How close are these role decompositions to optimal? This position paper argues that asymptotic analysis with LLM primitives is needed to reason about the efficiency of such problem decompositions, and that insights from such analysis will unlock opportunities for scaling such systems. By treating the LLM forward pass as the atomic unit of computational cost, one can separate out the (often opaque) inner workings of a particular LLM from the inherent efficiency of how a set of LLMs are orchestrated to solve hard problems. In other words, if we want to scale the deployment of LLMs to the limit, instead of anthropomorphizing LLMs, asymptotic analysis with LLM primitives should be used to reason about and develop more powerful decompositions of large problems into LLM agents.
Paperid:65
Authors:Nikhil Kandpal · Colin Raffel
Abstract:
Large Language Model (LLM) training is an increasingly expensive endeavor due to growing computational requirements, hardware demands, energy costs, and engineering labor. Apart from training costs, an oftoverlooked (and seldom paid) cost is the human labor required to write the trillions of words of text used to train state-of-the-art LLMs. In this position paper, we aim to assign a monetary value to this labor and make the case that the most expensive part of producing an LLMshouldbe the compensation provided to training data producers for their work. To support this position we study 64 LLMs released between 2016 and 2024, breaking down both the cost of training the models as well as the hypothetical cost of creating the training data. Our analysis indicates that even with an extremely conservative estimate of how much compensation should be provided for the human labor that went into creating the training data, the costs of these models' training datasets are 1-3 orders of magnitude larger than the costs to train the models themselves. In the face of the massive gap between the value of training data and the current lack of compensation for its creation, we highlight and discuss research directions that could enable fairer practices in the future.
Paperid:66
Authors:Yuhe Guo · Huayi Tang · Jiahong Ma · Hongteng Xu · Zhewei Wei
Abstract:
Abstract:Spectral graph learning builds upon two foundations: Graph Fourier basis as its theoretical cornerstone,with polynomial approximation to enable practical implementation. While this framework has led to numerous successful designs, we argue that its effectiveness might stem from mechanisms different from its theoretical foundations. In this paper, we identify two fundamental issues that challenge our current understanding: (1) The graph Fourier basis $\mathbf{U}$ (eigenvectors of the normalized graph Laplacian) faces too many questions to truly serve its intended role, particularly in preserving its semantic properties of Fourier analysis; (2) The limitations preventing expressive filters are not merely practical constraints, but fundamental barriers that naturally protect stability and generalization. Importantly, the two issues entangle with each other. The second obscured the first: the natural avoidance of complex filters has prevented us from fully confronting the questions about $\mathbf{U}$'s role as a Fourier basis. This observation leads to our position: the effectiveness of spectral GNNs relies less on Graph Fourier basis than originally conceived, or, in other words, **spectral GNNs might not be so spectral**. The position leads us to at least two potential research interests: to incorporate a more semantically meaningful graph dictionary except for $\mathbf{U}$, and to reexamine the theoretical role of the introduced polynomial techniques.
Paperid:67
Authors:Golnaz Mesbahi · Parham Mohammad Panahi · Olya Mastikhina · Steven Tang · Martha White · Adam White
Abstract:
In continual RL we want agents capable of neverending learning, and yet our evaluation methodologies do not reflect this. The standard practice in RL is to assume unfettered access to the deployment environment for the full lifetime of the agent. For example, agent designers select the best performing hyperparameters in Atari by testing each for 200 million frames and then reporting results on 200 million frames. In this position paper, we argue and demonstrate the pitfalls of this inappropriate empirical methodology: lifetime tuning. We provide empirical evidence to support our position by testing DQN and SAC across several of continuing and non-stationary environments with two main findings: (1) lifetime tuning does not allow us to identify algorithms that work well for continual learning---all algorithms equally succeed; (2) recently developed continual RL algorithms outperform standard non-continual algorithms when tuning is limited to a fraction of the agent's lifetime. The goal of this paper is to provide an explanation for why recent progress in continual RL has been mixed and motivate the development of empirical practices that better match the goals of continual RL.
Paperid:68
Authors:Andy Zhang · Kevin Klyman · Yifan Mai · Yoav Levine · Yian Zhang · Rishi Bommasani · Percy Liang
Abstract:
Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of traintest overlap, which refers to the extent to which the language model is trained on the very data it is being tested on. The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directly measure train-test overlap since they do not have access to the training data. To make this clear, we document the practices of 30 models, finding that just 9 models report train-test overlap: 4 models release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 models publish their train-test overlap methodology and statistics. By engaging with language model developers, we provide novel information about train-test overlap for three additional models. Overall, this position paper argues that language model developers should publish train-test overlap statistics and/or training data whenever they report evaluation results on public test sets. We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.
Paperid:69
Authors:Rashid Mushkani · Hugo Berard · Allison Cohen · Shin Koseki
Abstract:
This position paper proposes a “Right to AI,” which asserts that individuals and communities should meaningfully participate in the development and governance of the AI systems that shape their lives. Motivated by the increasing deployment of AI in critical domains and inspired by Henri Lefebvre's concept of the “Right to the City,” we reconceptualize AI as a societal infrastructure, rather than merely a product of expert design. In this paper, we critically evaluate how generative agents, largescale data extraction, and diverse cultural values bring new complexities to AI oversight. The paper proposes that grassroots participatory methodologies can mitigate biased outcomes and enhance social responsiveness. It asserts that data is socially produced and should be managed and owned collectively. Drawing on Sherry Arnstein’s Ladder of Citizen Participation and analyzing nine case studies, the paper develops a four-tier model for the Right to AI that situates the current paradigm and envisions an aspirational future. It proposes recommendations for inclusive data ownership, transparent design processes, and stakeholder-driven oversight. We also discuss market-led and state-centric alternatives and argue that participatory approaches offer a better balance between technical efficiency and democratic legitimacy.
Paperid:70
Authors:Kaiyu Yang · Gabriel Poesia · Jingxuan He · Wenda Li · Kristin Lauter · Swarat Chaudhuri · Dawn Song
Abstract:
AI for Mathematics (AI4Math) is intellectually intriguing and is crucial for AIdriven system design and verification. Extensive efforts on AI4Math have mirrored techniques in NLP, in particular, training large language models on carefully curated math datasets in text form. As a complementary yet less explored avenue, formal mathematical reasoning is grounded in formal systems such as proof assistants, which can verify the correctness of reasoning and provide automatic feedback. This position paper advocates formal mathematical reasoning as an indispensable component in future AI for math, formal verification, and verifiable generation. We summarize existing progress, discuss open challenges, and envision critical milestones to measure future success.
Paperid:71
Authors:Jillian Fisher · Ruth Elisabeth Appel · Chan Young Park · Yujin Potter · Liwei Jiang · Taylor Sorensen · Shangbin Feng · Yulia Tsvetkov · Margaret Roberts · Jennifer Pan · Dawn Song · Yejin Choi
Abstract:
AI systems often exhibit political bias, influencing users' opinions and decisionmaking. While political neutrality—defined as the absence of bias—is seen as an ideal solution for fairness and safety, this position paper argues that true political neutrality is neither feasible nor universally desirable due to its subjective nature and the biases inherent in AI training data, algorithms, and user interactions. However, inspired by Joseph Raz's philosophical insight that "neutrality [...] can be a matter of degree" (Raz, 1986), we argue that striving for some neutrality remains essential for promoting balanced AI interactions and mitigating user manipulation. Therefore, we use the term "approximation" of political neutrality to shift the focus from unattainable absolutes to achievable, practical proxies. We propose eight techniques for approximating neutrality across three levels of conceptualizing AI, examining their trade-offs and implementation strategies. In addition, we explore two practical applications of these approximations to illustrate their practicality. Finally, we assess our framework on current large language models (LLMs) at the output level, providing a demonstration of how it can be evaluated. This work seeks to advance nuanced discussions of political neutrality in AI and promote the development of responsible, aligned language models.
Paperid:72
Authors:Jillian Fisher · Ruth Elisabeth Appel · Chan Young Park · Yujin Potter · Liwei Jiang · Taylor Sorensen · Shangbin Feng · Yulia Tsvetkov · Margaret Roberts · Jennifer Pan · Dawn Song · Yejin Choi
Abstract:
AI systems often exhibit political bias, influencing users' opinions and decisionmaking. While political neutrality—defined as the absence of bias—is seen as an ideal solution for fairness and safety, this position paper argues that true political neutrality is neither feasible nor universally desirable due to its subjective nature and the biases inherent in AI training data, algorithms, and user interactions. However, inspired by Joseph Raz's philosophical insight that "neutrality [...] can be a matter of degree" (Raz, 1986), we argue that striving for some neutrality remains essential for promoting balanced AI interactions and mitigating user manipulation. Therefore, we use the term "approximation" of political neutrality to shift the focus from unattainable absolutes to achievable, practical proxies. We propose eight techniques for approximating neutrality across three levels of conceptualizing AI, examining their trade-offs and implementation strategies. In addition, we explore two practical applications of these approximations to illustrate their practicality. Finally, we assess our framework on current large language models (LLMs) at the output level, providing a demonstration of how it can be evaluated. This work seeks to advance nuanced discussions of political neutrality in AI and promote the development of responsible, aligned language models.
Paperid:73
Authors:Andrew C. Cullen · Paul MONTAGUE · Sarah Erfani · Benjamin Rubinstein
Abstract:
While certified robustness is widely promoted as a solution to adversarial examples in machine learning systems, significant challenges remain before these techniques can be meaningfully deployed in realworld applications. We identify critical gaps in current research, including the paradox of detection without distinction, the lack of clear criteria for practitioners to evaluate certification schemes, and the potential security risks arising from users' expectations surrounding ``guaranteed" robustness claims. This position paper is a call to arms for the certification research community, proposing concrete steps to address these fundamental challenges and advance the field toward practical applicability.
Paperid:74
Authors:Andrew C. Cullen · Paul MONTAGUE · Sarah Erfani · Benjamin Rubinstein
Abstract:
While certified robustness is widely promoted as a solution to adversarial examples in machine learning systems, significant challenges remain before these techniques can be meaningfully deployed in realworld applications. We identify critical gaps in current research, including the paradox of detection without distinction, the lack of clear criteria for practitioners to evaluate certification schemes, and the potential security risks arising from users' expectations surrounding ``guaranteed" robustness claims. This position paper is a call to arms for the certification research community, proposing concrete steps to address these fundamental challenges and advance the field toward practical applicability.
Paperid:75
Authors:Indradyumna Roy · Saswat Meher · Eeshaan Jain · Soumen Chakrabarti · Abir De
Abstract:
Data sets used in recent work on graph similarity scoring and matching tasks suffer from significant limitations. Using Graph Edit Distance (GED) as a showcase, we highlight pervasive issues such as traintest leakage and poor generalization, which have misguided the community's understanding and assessment of the capabilities of a method or model.These limitations arise, in part, because preparing labeled data is computationally expensive for combinatorial graph problems.We establish some key properties of GED that enable scalable data augmentation for training, and adversarial test set generation.Together, our analysis, experiments and insights establish new, sound guidelines for designing and evaluating future neural networks, and suggest open challenges for future research.
Paperid:76
Authors:Judy Hanwen Shen
Abstract:
Foundation models that are capable of automating cognitive tasks represent a pivotal technological shift, yet their societal implications remain unclear. These systems promise exciting advances, yet they also risk flooding our information ecosystem with formulaic, homogeneous, and potentially misleading synthetic content. Developing benchmarks grounded in real use cases where these risks are most significant is therefore critical. Through a thematic analysis using 2 million language model user prompts, we identify creative composition tasks as a prevalent usage category where users seek help with personal tasks that require everyday creativity. Our finegrained analysis identifies mismatches between current benchmarks and usage patterns among these tasks. Crucially, we argue that the same use cases that currently lack thorough evaluations can lead to negative downstream impacts. This position paper argues that benchmarks focused on creative composition tasks is a necessary step towards understanding the societal harms of AI-generated content. We call for greater transparency in usage patterns to inform the development of new benchmarks that can effectively measure both the progress and the impacts of models with creative capabilities.
Paperid:77
Authors:Patrik Reizinger · Randall Balestriero · David Klindt · Wieland Brendel
Abstract:
SelfSupervised Learning (SSL) powers many current AI systems. As research interest and investment grow, the SSL design space continues to expand. The Platonic view of SSL, following the Platonic Representation Hypothesis (PRH), suggests that despite different methods and engineering approaches, all representations converge to the same Platonic ideal. However, this phenomenon lacks precise theoretical explanation. By synthesizing evidence from Identifiability Theory (IT), we show that the PRH can emerge in SSL. However, current IT cannot explain SSL's empirical success. To bridge the gap between theory and practice, we propose expanding IT into what we term Singular Identifiability Theory (SITh), a broader theoretical framework encompassing the entire SSL pipeline. SITh would allow deeper insights into the implicit data assumptions in SSL and advance the field towards learning more interpretable and generalizable representations. We highlight three critical directions for future research: 1) training dynamics and convergence properties of SSL; 2) the impact of finite samples, batch size, and data diversity; and 3) the role of inductive biases in architecture, augmentations, initialization schemes, and optimizers.
Paperid:78
Authors:Ossi Räisä · Boris van Breugel · M van der Schaar
Abstract:
Any method's development and practical application is limited by our ability to measure its reliability. The popularity of generative modeling emphasizes the importance of good synthetic data metrics. Unfortunately, previous works have found many failure cases in current metrics, for example lack of outlier robustness and unclear lower and upper bounds. We propose a list of desiderata for synthetic data metrics, and a suite of sanity checks: carefully chosen simple experiments that aim to detect specific and known generative modeling failure modes. Based on these desiderata and the results of our checks, we arrive at our position: all current generative fidelity and diversity metrics are flawed. This significantly hinders practical use of synthetic data. Our aim is to convince the research community to spend more effort in developing metrics, instead of models. Additionally, through analyzing how current metrics fail, we provide practitioners with guidelines on how these metrics should (not) be used.
Paperid:79
Authors:Yash Goel · Ayan Sengupta · Tanmoy Chakraborty
Abstract:
We challenge the dominant focus on neural scaling laws and advocate for a paradigm shift toward downscaling in the development of large language models (LLMs). While scaling laws have provided critical insights into performance improvements through increasing model and dataset size, we emphasize the significant limitations of this approach, particularly in terms of computational inefficiency, environmental impact, and deployment constraints. To address these challenges, we propose a holistic framework for downscaling LLMs that seeks to maintain performance while drastically reducing resource demands. This paper outlines practical strategies for transitioning away from traditional scaling paradigms, advocating for a more sustainable, efficient, and accessible approach to LLM development.
Paperid:80
Authors:Sanchaita Hazra · Bodhisattwa Prasad Majumder · Tuhin Chakrabarty
Abstract:
Current efforts in AI safety prioritize filtering harmful content, preventing manipulation of human behavior, and eliminating existential risks in cybersecurity or biosecurity. While pressing, this narrow focus overlooks critical humancentric considerations that shape the long-term trajectory of a society. In this position paper, we identify the risks of overlooking the impact of AI on the future of work and recommend comprehensive transition support towards the evolution of meaningful labor with human agency. Through the lens of economic theories, we highlight the intertemporal impacts of AI on human livelihood and the structural changes in labor markets that exacerbate income inequality. Additionally, the closed-source approach of major stakeholders in AI development resembles rent-seeking behavior through exploiting resources, breeding mediocrity in creative labor, and monopolizing innovation. To address this, we argue in favor of a robust international copyright anatomy supported by implementing collective licensing that ensures fair compensation mechanisms for using data to train AI models. We strongly recommend a pro-worker framework of global AI governance to enhance shared prosperity and economic justice while reducing technical debt.
Paperid:81
Authors:Sanchaita Hazra · Bodhisattwa Prasad Majumder · Tuhin Chakrabarty
Abstract:
Current efforts in AI safety prioritize filtering harmful content, preventing manipulation of human behavior, and eliminating existential risks in cybersecurity or biosecurity. While pressing, this narrow focus overlooks critical humancentric considerations that shape the long-term trajectory of a society. In this position paper, we identify the risks of overlooking the impact of AI on the future of work and recommend comprehensive transition support towards the evolution of meaningful labor with human agency. Through the lens of economic theories, we highlight the intertemporal impacts of AI on human livelihood and the structural changes in labor markets that exacerbate income inequality. Additionally, the closed-source approach of major stakeholders in AI development resembles rent-seeking behavior through exploiting resources, breeding mediocrity in creative labor, and monopolizing innovation. To address this, we argue in favor of a robust international copyright anatomy supported by implementing collective licensing that ensures fair compensation mechanisms for using data to train AI models. We strongly recommend a pro-worker framework of global AI governance to enhance shared prosperity and economic justice while reducing technical debt.
Paperid:82
Authors:Matthew Riemer · Zahra Ashktorab · Djallel Bouneffouf · Payel Das · Miao Liu · Justin Weisz · Murray Campbell
Abstract:
This position paper argues that the majority of theory of mind benchmarks are broken because of their inability to directly test how large language models (LLMs) adapt to new partners. This problem stems from the fact that theory of mind benchmarks for LLMs are overwhelmingly inspired by the methods used to test theory of mind in humans and fall victim to a fallacy of attributing humanlike qualities to AI agents. We expect that humans will engage in a consistent reasoning process across various questions about a situation, but this is known to not be the case for current LLMs. Most theory of mind benchmarks only measure what we callliteral theory of mind: the ability to predict the behavior of others. Measuring this kind of reasoning is very informative in testing the ability of agents with self-consistent reasoning. However, it is important to note the distinction between this and what we actually care about when this self-consistency cannot be taken for granted. We call thisfunctional theory of mind: the ability to adapt to agents in-context following a rational response to predictions about their behavior. We find that top performing open source LLMs may display strong capabilities in literal theory of mind, depending on how they are prompted, but seem to struggle with functional theory of mind -- even when partner policies are exceedingly simple. Simply put, strong literal theory of mind performance does not necessarily imply strong functional theory of mind performance. Achieving functional theory of mind, particularly over long interaction horizons with a partner, is a significant challenge deserving a prominent role in any meaningful LLM theory of mind evaluation.
Paperid:83
Authors:Matej Pičulin · Bernarda Petek · Irena Ograjenšek · Erik Štrumbelj
Abstract:
In this position paper, we argue that user studies are key to evaluating explainable AI and that the current state of user studies is detrimental to the advancement of the field. We support this argument with a review of general and explainable AIspecific challenges, as well as an analysis of 607 explainable AI papers featuring user studies. To address this issue, we call for better-designed user studies and greater appreciation of high-quality user studies in the AI community.
Paperid:84
Authors:Shayne Longpre · Kevin Klyman · Ruth Elisabeth Appel · Sayash Kapoor · Rishi Bommasani · Michelle Sahar · Sean McGregor · Avijit Ghosh · Borhane Blili-Hamelin · Nathan Butters · Alondra Nelson · Amit Elazari · Andrew Sellars · Casey Ellis · Dane Sherrets · Dawn Song · Harley Geiger · Ilona Cohen · Lauren McIlvenny · Madhulika Srikumar · Mark Jaycox · Markus Anderljung · Nadine Johnson · Nicholas Carlini · Nicolas Miailhe · Nik Marda · Peter Henderson · Rebecca Portnoff · Rebecca Weiss · Victoria Westerhoff · Yacine Jernite · Rumman Chowdhury · Percy Liang · Arvind Narayanan
Abstract:
The widespread deployment of generalpurpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems lag far behind more mature industries, like software security. Based on a collaboration between experts from software security, machine learning, law, and policy, we identify key gaps in the evaluation and reporting of GPAI system flaws. Our position paper argues for three interventions to ultimately improve system safety: (1) the use of standardized AI system flaw reporting, (2) the adoption of flaw disclosure programs with researcher protections, and (3) infrastructure to coordinate flaw disclosure across stakeholders. First, we propose standardized AI flaw reports, to ease the process of submitting, reproducing, and triaging identified system flaws. Second, we propose GPAI system providers should adopt broadly-scoped flaw disclosure programs, borrowing from bug bounty programs, with legal safe harbors to protect both researchers and themselves. Third, we advocate for new infrastructure to coordinate flaw reports across many stakeholders, who may be impacted by certain flaws that commonly work on many independent GPAI systems (e.g., certain jailbreaks). By promoting a robust safety reporting and coordination in the GPAI ecosystem, these changes, if adopted, could ultimately improve AI system safety, security, and accountability.
Paperid:85
Authors:David Danhofer · Davide D'Ascenzo · Rafael Dubach · Tomaso A Poggio
Abstract:
Overparametrized Deep Neural Networks (DNNs) have demonstrated remarkable success in a wide variety of domains too highdimensional for classical learners and shallow networks subject to the curse of dimensionality. However, open questions remain about fundamental principles, that govern the learning dynamics of DNNs. In this position paper, we argue that it is the ability of DNNs to exploit the compositionally sparse structure of the target function, that drives their success. As such, DNNs can leverage the property that most functions can be composed from a small set of constituent functions, each of which relies only on a low-dimensional subset of all inputs. We show that this property is shared by all efficiently Turing-computable functions and therefore highly likely present in all current learning problems. While some highly promising theoretical insights on questions concerned with approximation and generalization exist in the setting of compositionally sparse functions, several important questions on the learnability and optimization of DNNs remain open. Completing the picture on the role of compositional sparsity in deep learning is essential to a comprehensive theory of artificial - and even general - intelligence.
Paperid:86
Authors:Tobin South · Samuele Marro · Thomas Hardjono · Robert Mahari · Cedric Whitney · Alan Chan · Alex Pentland
Abstract:
The rapid deployment of autonomous AI agents creates urgent challenges in the areas of authorization, accountability, and access control in task delegation. This position paper argues that authenticated and auditable delegation of authority to AI agents is a critical component of mitigating practical risks and unlocking the value of agents. To support this argument, we examine how existing web authentication and authorization protocols, as well as natural language interfaces to common access control mechanisms, can be extended to enable secure authenticated delegation of authority to AI agents. By extending OAuth 2.0 and OpenID Connect with agentspecific credentials and using transparent translation of natural language permissions into robust scoping rules across diverse interaction modalities, we outline how authenticated delegation can be achieved to enable clear chains of accountability while maintaining compatibility with established authentication and web infrastructure for immediate compatibility. This work contributes to ensuring that agentic AI systems perform only appropriate actions. It argues for prioritizing delegation infrastructure as a key component of AI agent governance and provides a roadmap for achieving this.
Paperid:87
Authors:Tobin South · Samuele Marro · Thomas Hardjono · Robert Mahari · Cedric Whitney · Alan Chan · Alex Pentland
Abstract:
The rapid deployment of autonomous AI agents creates urgent challenges in the areas of authorization, accountability, and access control in task delegation. This position paper argues that authenticated and auditable delegation of authority to AI agents is a critical component of mitigating practical risks and unlocking the value of agents. To support this argument, we examine how existing web authentication and authorization protocols, as well as natural language interfaces to common access control mechanisms, can be extended to enable secure authenticated delegation of authority to AI agents. By extending OAuth 2.0 and OpenID Connect with agentspecific credentials and using transparent translation of natural language permissions into robust scoping rules across diverse interaction modalities, we outline how authenticated delegation can be achieved to enable clear chains of accountability while maintaining compatibility with established authentication and web infrastructure for immediate compatibility. This work contributes to ensuring that agentic AI systems perform only appropriate actions. It argues for prioritizing delegation infrastructure as a key component of AI agent governance and provides a roadmap for achieving this.
Paperid:88
Authors:Frithjof Gressmann · Ashley Chen · Lily Xie · Nancy Amato · Lawrence Rauchwerger
Abstract:
Recent advances in bioengineering have enabled the creation of biological neural networks in vitro, significantly reducing the cost, ethical hurdles, and complexity of experimentation with genuine biological neural computation. In this position paper, we argue that this trend offers a unique and timely opportunity to put our understanding of neural computation to the test. By designing artificial neural networks that can interact and control living neural systems, it is becoming possible to validate computational models beyond simulation and gain empirical insights to help unlock more robust and energyefficient next-generation AI systems. We provide an overview of key technologies, challenges, and principles behind this development and describe strategies and opportunities for novel machine learning research in this emerging field. We also discuss implications and fundamental questions that could be answered as this technology advances, exemplifying the longer-term impact of increasingly sophisticated in vitro neural networks.
Paperid:89
Authors:Samuel Gabriel Müller · Arik Reuter · Noah Hollmann · David Rügamer · Frank Hutter
Abstract:
Training neural networks on randomly generated artificial datasets leads to Bayesian models that capture the prior defined by the datasetgenerating distribution.Prior-data Fitted Networks (PFNs) are a class of methods designed to leverage this insight.In an era of rapidly increasing pre-training computational resources and a near stagnation in the generation of new real-world data in many applications, PFNs are poised to play a more important role across a wide range of applications.They enable the efficient allocation of pre-training compute to low-data scenarios.Originally applied to small Bayesian modeling tasks, the field of PFNs has significantly expanded to address more complex domains and larger datasets. This position paper argues that PFNs and other amortized inference approaches represent the future of Bayesian inference, leveraging amortized learning to tackle data-scarce problems. We thus believe they are a fruitful area of research. In this position paper, we explore their potential and directions to address their current limitations.
Paperid:90
Authors:Sebastin Santy · Prasanta Bhattacharya · Manoel Ribeiro · Kelsey Allen · Sewoong Oh
Abstract:
Progress in AI has relied on humangenerated data, from annotator marketplaces to the wider Internet. However, the widespread use of large language models now threatens the quality and integrity of human-generated data on these very platforms. We argue that this issue goes beyond the immediate challenge of filtering AI-generated content -- it reveals deeper flaws in how data collection systems are designed. Existing systems often prioritize speed, scale, and efficiency at the cost of intrinsic human motivation, leading to declining engagement and data quality. We propose that rethinking data collection systems to align with contributors' intrinsic motivations--rather than relying solely on external incentives--can help sustain high-quality data sourcing at scale while maintaining contributor trust and long-term participation.
Paperid:91
Authors:Tennison Liu · M van der Schaar
Abstract:
Selfimproving agents aim to continuously acquire new capabilities with minimal supervision. However, current approaches face two key limitations: rigid self-improvement processes fail to generalize across domain/task distributions and increasing agent capabilities. We argue that self-improvement require metacognitive learning---an agent'sintrinsicability to actively evaluate, reflect, and adapt their learning processes dynamically. Drawing parallels to human metacognition, we introduce a formal framework with three components:metacognitive knowledge(self-assessment of capabilities, tasks, and strategies),metacognitive planning(deciding what and how to learn), andmetacognitive evaluation(reflecting on past experiences to refine learning). Analyzing current self-improving agents, we find they rely heavily on extrinsic metacognitive mechanisms, where fixed, human-designed self-improvement loops limit scalability and adaptability. Through analyzing each component, we contend that many of the ingredients to realize intrinsic metacognition are already present. Finally, we explore methods to optimally balance metacognition responsibilities between humans and agents, and identify the key technical hurdles that will enable self-improving agents to sustain efficient, generalizable, and aligned learning over time.
Paperid:92
Authors:Andrew Wilson
Abstract:
Deep learning is often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization. This position paper argues that these phenomena are not distinct to neural networks, or particularly mysterious. Moreover, this generalization behaviour can be intuitively understood, and rigorously characterized using longstanding generalization frameworks such as PAC-Bayes and finite hypothesis bounds. We present soft inductive biases as a key unifying principle in explaining these phenomena: rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem.
Paperid:93
Authors:Simone Drago · Marco Mussi · Alberto Maria Metelli
Abstract:
Mainstream research in theoretical RL is currently focused on designing online learning algorithms with regret bounds that match the corresponding regret lower bound up to multiplicative constants (and, sometimes, logarithmic terms). In this position paper, we constructively question this trend, arguing that algorithms should be designed to at least minimize the amount of unnecessary exploration, and we highlight the significant role constants play in algorithms' actual performances. This trend also exacerbates the misalignment between theoretical researchers and practitioners. As an emblematic example, we consider the case of regret minimization in finitehorizon tabular MDPs. Starting from the well-known UCBVI algorithm, we improve the bonus terms and the corresponding regret analysis. Additionally, we compare our version of UCBVI with both its original version and the state-of-the-art MVP algorithm. Our empirical validation successfully demonstrates how improving the multiplicative constants has significant positive effects on the actual empirical performances of the algorithm under analysis. This raises the question of whether ignoring constants when assessing whether algorithms match is the proper approach.
Paperid:94
Authors:Moming Duan · Mingzhe Du · Rui Zhao · Mengying Wang · Yinghui Wu · Nigel Shadbolt · Bingsheng He
Abstract:
The Machine Learning (ML) community has witnessed explosive growth, with millions of ML models being published on the Web. Reusing ML model components has been prevalent nowadays. Developers are often required to choose a license to publish and govern the use of their models. Popular options include Apache2.0, OpenRAIL (Responsible AI Licenses), Creative Commons Licenses (CCs), Llama2, and GPL-3.0. Currently, no standard or widely accepted best practices exist for model licensing. But does this lack of standardization lead to undesired consequences? Our answer is Yes. After reviewing the clauses of the most widely adopted licenses, we take the position thatcurrent model licensing practices are dragging us into a quagmire of legal noncompliance. To support this view, we explore the current practices in model licensing and highlight the differences between various model licenses. We then identify potential legal risks associated with these licenses and demonstrate these risks using examples from real-world repositories on Hugging Face. To foster a more standardized future for model licensing, we also propose a new draft of model licenses, MGLs, to address these challenges and promote better compliance.
Paperid:95
Authors:Moming Duan · Mingzhe Du · Rui Zhao · Mengying Wang · Yinghui Wu · Nigel Shadbolt · Bingsheng He
Abstract:
The Machine Learning (ML) community has witnessed explosive growth, with millions of ML models being published on the Web. Reusing ML model components has been prevalent nowadays. Developers are often required to choose a license to publish and govern the use of their models. Popular options include Apache2.0, OpenRAIL (Responsible AI Licenses), Creative Commons Licenses (CCs), Llama2, and GPL-3.0. Currently, no standard or widely accepted best practices exist for model licensing. But does this lack of standardization lead to undesired consequences? Our answer is Yes. After reviewing the clauses of the most widely adopted licenses, we take the position thatcurrent model licensing practices are dragging us into a quagmire of legal noncompliance. To support this view, we explore the current practices in model licensing and highlight the differences between various model licenses. We then identify potential legal risks associated with these licenses and demonstrate these risks using examples from real-world repositories on Hugging Face. To foster a more standardized future for model licensing, we also propose a new draft of model licenses, MGLs, to address these challenges and promote better compliance.
Paperid:96
Authors:Hanna Wallach · Meera Desai · A. Feder Cooper · Angelina Wang · Chad Atalla · Solon Barocas · Su Lin Blodgett · Alexandra Chouldechova · Emily Corvi · P. Alex Dow · Jean Garcia-Gathright · Alexandra Olteanu · Nicholas Pangakis · Stefanie Reed · Emily Sheng · Dan Vann · Jennifer Wortman Vaughan · Matthew Vogel · Hannah Washington · Abigail Z. Jacobs
Abstract:
The measurement tasks involved in evaluating generative AI (GenAI) systems are especially difficult, leading to what has been described as "a tangle of sloppy tests [and] applesto-oranges comparisons" (Roose, 2024). In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating GenAI systems. Specifically, our position is that evaluating GenAI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI. This framework has two important implications for designing and evaluating evaluations: First, it can broaden the expertise involved in evaluating GenAI systems by enabling stakeholders with different perspectives to participate in conceptual debates. Second, it brings rigor to both conceptual and operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.
Paperid:97
Authors:Borhane Blili-Hamelin · Christopher Graziul · Leif Hancox-Li · Hananel Hazan · El-Mahdi El-Mhamdi · Avijit Ghosh · Katherine Heller · Jacob Metcalf · Fabricio Murai · Eryk Salvaggio · ANDREW SMART · Todd Snider · Mariame Tighanimine · Talia Ringer · Margaret Mitchell · Shiri Dori-Hacohen
Abstract:
The AI research community plays a vital role in shaping the scientific, engineering, and societal goals of AI research. In this position paper, we argue that focusing on the highly contested topic of artificial general intelligence (`AGI') undermines our ability to choose effective goals. We identify six key trapsobstacles to productive goal setting---that are aggravated by AGI discourse: Illusion of Consensus, Supercharging Bad Science, Presuming Value-Neutrality, Goal Lottery, Generality Debt, and Normalized Exclusion. To avoid these traps, we argue that the AI research community needs to (1) prioritize specificity in engineering and societal goals, (2) center pluralism about multiple worthwhile approaches to multiple valuable goals, and (3) foster innovation through greater inclusion of disciplines and communities. Therefore, the AI research community needs to stop treating ``AGI'' as the north-star goal of AI research.
Paperid:98
Authors:Anish Dhir · Ruby Sedgwick · Avinash Kori · Ben Glocker · Mark van der Wilk
Abstract:
Current causal discovery approaches require restrictive model assumptions in the absence of interventional data to ensure structure identifiability. These assumptions often do not hold in realworld applications leading to a loss of guarantees and poor performance in practice. Recent work has shown that, in the bivariate case, Bayesian model selection can greatly improve performance by exchanging restrictive modelling for more flexible assumptions, at the cost of a small probability of making an error. Our work shows that this approach is useful in the important multivariate case as well. We propose a scalable algorithm leveraging a continuous relaxation of the discrete model selection problem. Specifically, we employ the Causal Gaussian Process Conditional Density Estimator (CGP-CDE) as a Bayesian non-parametric model, using its hyperparameters to construct an adjacency matrix. This matrix is then optimised using the marginal likelihood and an acyclicity regulariser, giving the maximum a posteriori causal graph. We demonstrate the competitiveness of our approach, showing it is advantageous to perform multivariate causal discovery without infeasible assumptions using Bayesian model selection.
Paperid:99
Authors:Jingfeng Zhang · Prashanth Krishnamurthy · Naman Patel · Anthony Tzes · Farshad Khorrami
Abstract:
Despite the success of current multimodal learning at scale, its susceptibility to data poisoning attacks poses security concerns in critical applications. Attacker can manipulate model behavior by injecting maliciously crafted yet minute instances into the training set, stealthily mismatching distinct concepts. Recent studies have manifested the vulnerability by poisoning multimodal tasks such as TextImage Retrieval (TIR) and Visual Question Answering (VQA). However, the current attacking method only rely on random choice of concepts for misassociation and random instance selections for injecting the poisoning noise, which often achieves the suboptimal effect and even risks failure due to the dilution of poisons by the large number of benign instances. This study introduces MP-Nav (Multimodal Poison Navigator), a plug-and-play module designed to evaluate and even enhance data poisoning attacks against multimodal models. MP-Nav operates at both the concept and instance levels, identifying semantically similar concept pairs and selecting robust instances to maximize the attack efficacy. The experiments corroborate MP-Nav can significantly improve the efficacy of state-of-the-art data poisoning attacks such as AtoB and ShadowCast in multimodal tasks, and maintain model utility across diverse datasets. Notably, this study underscores the vulnerabilities of multimodal models and calls for the counterpart defenses.
Paperid:100
Authors:Wenju Sun · Qingyong Li · Yangliao Geng · Boyang Li
Abstract:
Multitask model merging offers a promising paradigm for integrating multiple expert models into a unified system without additional training. Existing state-of-the-art techniques, such as Task Arithmetic and its variants, merge models by accumulating task vectors—defined as the parameter differences between pre-trained and fine-tuned models. However, task vector accumulation is often hindered by knowledge conflicts, where conflicting components across different task vectors can lead to performance degradation during the merging process. To address this challenge, we proposeConflict-Aware Task Merging (CAT Merging), a novel training-free framework that selectively trims conflict-prone components from the task vectors. CAT Merging introduces several parameter-specific strategies, including projection for linear weights and masking for scaling and shifting parameters in normalization layers. Extensive experiments on vision and vision-language tasks demonstrate that CAT Merging effectively suppresses knowledge conflicts, achieving average accuracy improvements of up to 4.7% (ViT-B/32) and 2.0% (ViT-L/14) over state-of-the-art methods.
Paperid:101
Authors:Mohit Pandey · Gopeshh Subbaraj · Artem Cherkasov · Martin Ester · Emmanuel Bengio
Abstract:
Generative Flow Networks (GFlowNets) have recently emerged as a suitable framework for generating diverse and highquality molecular structures by learning from rewards treated as unnormalized distributions. Previous works in this framework often restrict exploration by using predefined molecular fragments as building blocks, limiting the chemical space that can be accessed. In this work, we introduce Atomic GFlowNets (A-GFNs), a foundational generative model leveraging individual atoms as building blocks to explore drug-like chemical space more comprehensively. We propose an unsupervised pre-training approach using drug-like molecule datasets, which teaches A-GFNs about inexpensive yet informative molecular descriptors such as drug-likeliness, topological polar surface area, and synthetic accessibility scores. These properties serve as proxy rewards, guiding A-GFNs towards regions of chemical space that exhibit desirable pharmacological properties. We further implement a goal-conditioned finetuning process, which adapts A-GFNs to optimize for specific target properties. In this work, we pretrain A-GFN on a subset of ZINC dataset, and by employing robust evaluation metrics we show the effectiveness of our approach when compared to other relevant baseline methods for a wide range of drug design tasks.
Paperid:102
Authors:Shihao Zou · Qingfeng Li · Wei Ji · Jingjing Li · Yongkui Yang · Guoqi Li · Chao Dong
Abstract:
Abstract:Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNNbased Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15\% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the three tasks, respectively. _Source code will be available._
Paperid:103
Authors:Wei Chen · Shigui Li · Jiacheng Li · Junmei Yang · John Paisley · Delu Zeng
Abstract:
Abstract:Density ratio estimation is critical in applications involving fdivergences, yet existing methods struggle with multi-modal distributions or significant distributional discrepancies, leading to the \textit{density-chasm} problem. To address this, we propose Dequantified Diffusion-Bridge Interpolants (DDBI), which leverages diffusion processes to model smooth transitions between distributions. By integrating optimal transport theory, we extend DDBI to solve the Schrödinger-Bridge problem, resulting in Dequantified Schrödinger-Bridge Interpolants (DSBI).Combining these, we present the Dequantified Diffusion-bridge Density-Ratio Estimation ($\text{D}^3\text{RE}$) framework, which theoretically reduces estimation error in asymptotic density ratio estimation.Experiments demonstrate $\text{D}^3\text{RE}$'s in a variety of tasks, including mutual information estimation and density estimation.
Paperid:104
Authors:Jinda Lu · Junkang Wu · Jinghan Li · Xiaojun Jia · Shuo Wang · Yi-Fan Zhang · Junfeng Fang · Xiang Wang · Xiangnan He
Abstract:
Direct Preference Optimization (DPO) has shown effectiveness in aligning multimodal large language models (MLLM) with human preferences. However, existing methods exhibit an imbalanced responsiveness to the data of varying hardness, tending to overfit on the easy-to-distinguish data while underfitting on the hard-to-distinguish data. In this paper, we propose Data- and Model-aware DPO (DAMO) to dynamically adjust the optimization process from two key aspects: (1) a data-aware strategy that incorporates data hardness, and (2) a model-aware strategy that integrates real-time model responses. By combining the two strategies, DAMO enables the model to effectively adapt to data with varying levels of hardness.Extensive experiments on five benchmarks demonstrate that DAMO not only significantly enhances the trustworthiness, but also improves the effectiveness over general tasks. For instance, on the Object HalBench, our DAMO-7B reduces response-level and mentioned-level hallucination by 90.0% and 95.3%, respectively, surpassing the performance of GPT-4V.
Paperid:105
Authors:Mingi Jung · Saehyung Lee · Eunji Kim · Sungroh Yoon
Abstract:
Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. Highquality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To address this issue, we propose SPARC (Selective Progressive Attention ReCalibration), a training-free method that enhances the contribution of visual tokens during decoding. SPARC is founded on three key observations: (1) increasing the influence of all visual tokens reduces recall; thus, SPARC selectively amplifies visual tokens; (2) as captions lengthen, visual attention becomes noisier, so SPARC identifies critical visual tokens by leveraging attention differences across time steps; (3) as visual attention gradually weakens, SPARC reinforces it to preserve its influence. Our experiments, incorporating both automated and human evaluations, demonstrate that existing methods improve the precision of MLLMs at the cost of recall. In contrast, our proposed method enhances both precision and recall with minimal computational overhead.
Paperid:106
Authors:L. Elisa Celis · Lingxiao Huang · Nisheeth K. Vishnoi
Abstract:
The rapid rise of Generative AI (GenAI) tools has sparked debates about their role in complementing or replacing human workers across job contexts. This paper approaches this question by presenting a mathematical framework for modeling jobs, workers, and metrics to evaluate workerjob fit. A key feature of our framework is its decomposition of skills into decision-level and action-level subskills, reflecting the distinct strengths of humans and AI. We analyze how changes in subskill abilities affect job success, identifying conditions for dramatic shifts in success probability. Furthermore, we establish sufficient conditions under which combining workers with complementary subskills significantly outperforms relying on a single worker. We demonstrate the framework’s practicality using data from O*NET and Big-Bench Lite, aligning real-world data with our framework through subskill-division methods. Our results highlight the role of GenAI tools in complementing human workers' skills rather than replacing them.
Paperid:107
Authors:Yixin Liu · Lie Lu · Jihui Jin · Lichao Sun · Andrea Fanelli
Abstract:
The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural networkbased watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to achieve both robust detection and accurate attribution simultaneously. This paper introduces the Cross-Attention Robust Audio Watermark (XAttnMark), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned temporal-frequency masking loss that captures fine-grained auditory masking effects, enhancing watermark imperceptibility. Our approach achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing with strong editing strength. This work represents a significant step forward in protecting intellectual property and ensuring the authenticity of audio content in the era of generative AI.
Paperid:108
Authors:Sifan Yang · Yuanyu Wan · Peijia Li · Yibo Wang · Xiao Zhang · Zhewei Wei · Lijun Zhang
Abstract:
Abstract:In this paper, we investigate the acceleration of adaptive subgradient methods through Frequent Directions (FD), a widelyused matrix sketching technique. The state-of-the-art regret bound exhibits a linear dependence on the dimensionality $d$, leading to unsatisfactory guarantees for high-dimensional problems. Additionally, it suffers from an $O(\tau^2 d)$ time complexity per round, which scales quadratically with the sketching size $\tau$. To overcome these issues, we first introduce an algorithm named FTSL, achieving a tighter regret bound that is independent of the dimensionality. The key idea is to integrate FD with adaptive subgradient methods under the primal-dual framework, and add the cumulative discarded information of FD back. To reduce its time complexity, we further utilize fast FD to expedite FTSL, yielding a better complexity of $O(\tau d)$ while maintaining the same regret bound. Moreover, to mitigate the computational cost for optimization problems involving matrix variables (e.g., training neural networks), we adapt FD to Shampoo, a popular optimization algorithm that accounts for the structure of decision, and give a novel analysis under the primal-dual framework. Our proposed method obtains an improved dimension-free regret bound alongside enhanced computational complexity. Experimental results have verified the efficiency and effectiveness of our approaches.
Paperid:109
Authors:Chendi Ge · Xin Wang · Zeyang Zhang · Hong Chen · Jiapei Fan · Longtao Huang · Hui Xue' · Wenwu Zhu
Abstract:
Continual multimodal instruction tuning (CMIT) is crucial for adapting Multimodal Large Language Models (MLLMs) to evolving tasks in realworld scenarios. However, most existing methods adopt a fixed architecture, struggling with adapting to new tasks due to static model capacity. In this paper, we propose evolving the architecture under parameter budgets for dynamic task adaptation in CMIT, a problem which remains unexplored in the literature. The problem presents two main challenges: 1) task architecture conflict, where different tasks require varying layer-wise adaptations, and 2) modality imbalance, where different tasks rely unevenly on modalities, leading to unbalanced updates. To address these challenges, we propose a novel Dynamic Mixture of Curriculum LoRA Experts (D-MoLE) method for CMIT, which automatically evolve MLLM's architecture with controlled parameter budgets to continually adapt to new tasks while retaining previously learned knowledge. Specifically, we propose a dynamic layer-wise expert allocator, which automatically allocates LoRA experts across layers to resolve architecture conflicts, and route instructions layer-wisely to facilitate knowledge sharing among experts. Then, we propose a gradient-based inter-modal continual curriculum, which adjusts the update ratio of each module in MLLM based on the difficulty of each modality within the task to alleviate the modality imbalance problem. Extensive experiments show that D-MoLE significantly outperforms state-of-the-art baselines, achieving a 15% average improvement over the best baseline. To the best of our knowledge, this is the first study of continual learning for MLLMs from an architectural perspective.
Paperid:110
Authors:Baijiong Lin · Weisen Jiang · Yuancheng Xu · Hao Chen · YINGCONG CHEN
Abstract:
Multiobjective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference without retraining. Recently, GenARM (Xu et al., 2025), a test-time alignment method, introduced the Autoregressive Reward Model (ARM) to effectively guide the generation of frozen LLMs during inference. When applied to multi-objective scenarios, GenARM needs multiple separately trained ARMs, each tailored to a specific preference dimension. GenARM faces two key limitations: the need for multiple ARMs increases the inference cost, and the separate training of ARMs leads to the misalignment between the guided generation and the use preferences.To address these issues, we propose Preference-aware ARM (PARM), a unified ARM conditioned on user preferences. PARM is implemented by our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a bilinear form to condition the ARM on user preferences, enabling it to dynamically manage trade-offs across multiple preference dimensions during inference. Experiments demonstrate that PARM reduces inference costs and improves alignment performance compared to GenARM. Additionally, PARM enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training, offering a flexible solution for test-time preference adaptation.
Paperid:111
Authors:Antonio Almudévar · Jose Miguel Hernandez-Lobato · Sameer Khurana · Ricard Marxer · Alfonso Ortega
Abstract:
Contrastive losses have been extensively used as a tool for multimodal representation learning. However, it has been empirically observed that its use is not effective to learn an aligned representation space.In this paper, we argue that this phenomenon is caused by the presence of modalityspecific information in the representation space. Although some of the most widely used among the contrastive losses maximize the mutual information between representations of both modalities, they are not designed to remove the modality-specific information.We give a theoretical description of this problem through the lens of the Information Bottleneck Principle. We also empirically analyze how different hyperparameters affect to the emergence of this phenomenon in a controlled experimental setup.Finally, we propose a regularization term in the loss function that is derived by means of a variational approximation that aims to increase the representational alignment.We analyze in a set of controlled experiments and real-world applications the advantages of including this regularization term.
Paperid:112
Authors:Hongyi Liu · Rajarshi Saha · Zhen Jia · Youngsuk Park · Jiaji Huang · Shoham Sabach · Yu-Xiang Wang · George Karypis
Abstract:
Large Language Models (LLMs) have demonstrated exceptional performance in natural language processing tasks, yet their massive size makes serving them inefficient and costly. Semistructured pruning has emerged as an effective method for model acceleration, but existing approaches are suboptimal because they focus on local, layer-wise optimizations using heuristic rules, failing to leverage global feedback. We present ProxSparse, a learning-based framework for mask selection enabled by regularized optimization. ProxSparse transforms the rigid, non-differentiable mask selection process into a smoother optimization procedure, allowing gradual mask exploration with flexibility. ProxSparse does not involve additional weight updates once the mask is determined. Our extensive evaluations on 7 widely used models show that ProxSparse consistently outperforms previously proposed semi-structured mask selection methods with significant improvement, demonstrating the effectiveness of our learned approach towards semi-structured pruning.
Paperid:113
Authors:Vincent P Grande · Michael Schaub
Abstract:
Topological Data Analysis (TDA) allows us to extract powerful topological and higherorder information on the global shape of a data set or point cloud.Tools like Persistent Homology give a single complex description of the global structure of the point cloud.However, common machine learning applications like classification require point-level information and features.In this paper, we bridge this gap and propose a novel method to extract node-level topological features from complex point clouds using discrete variants of concepts from algebraic topology and differential geometry.We verify the effectiveness of these topological point features (TOPF) on both synthetic and real-world data and study their robustness under noise and heterogeneous sampling.
Paperid:114
Authors:Tejas Jayashankar · Jongha (Jon) Ryu · Gregory Wornell
Abstract:
Abstract:We propose *ScoreMix*, a novel framework for training onestep generative models by minimizing a class of divergences called the$\alpha$-skew Jensen–Shannon divergence. At its core, ScoreMix estimates the score of mixture distributions between real and fake samples across multiple noise levels.Similar to consistency models, our approach supports both training from scratch and distillation using a pretrained diffusion model. It is simple to implement, requires minimal hyperparameter tuning, and ensures stable training. Experiments on CIFAR-10 and ImageNet 64×64 show that ScoreMix is competitive with and can even outperform existing methods.
Paperid:115
Authors:Liangchen Liu · Nannan Wang · Xi Yang · Xinbo Gao · Tongliang Liu
Abstract:
Prompt learning is a cuttingedge parameter-efficient fine-tuning technique for pre-trained vision-language models (VLMs). Instead of learning a single text prompt, recent works have revealed that learning diverse text prompts can effectively boost the performances on downstream tasks, as the diverse prompted text features can comprehensively depict the visual concepts from different perspectives. However, diverse prompt learning demands enormous computational resources. This efficiency issue still remains unexplored. To achieve efficient and diverse prompt learning, this paper proposes a novel \textbf{Surrogate Prompt Learning (SurPL)} framework. Instead of learning diverse text prompts, SurPL directly generates the desired prompted text features via a lightweight \textbf{Surrogate Feature Generator (SFG)}, thereby avoiding the complex gradient computation procedure of conventional diverse prompt learning. Concretely, based on a basic prompted text feature, SFG can directly and efficiently generate diverse prompted features according to different pre-defined conditional signals. Extensive experiments indicate the effectiveness of the surrogate prompted text features, and show compelling performances and efficiency of SurPL on various benchmarks.
Paperid:116
Authors:Hao Gu · Wei Li · Lujun Li · Qiyuan Zhu · Mark Lee · Shengjie Sun · Wei Xue · Yike Guo
Abstract:
Abstract:Mixtureof-Experts (MoE) architectures in large language models (LLMs) achieve exceptional performance, but face prohibitive storage and memory requirements. To address these challenges, we present $D^2$-MoE, a new delta decompression compressor for reducing the parameters of MoE LLMs. Based on observations of expert diversity, we decompose their weights into a shared base weight and unique delta weights. Specifically, our method first merges each expert's weight into the base weight using the Fisher information matrix to capture shared components. Then, we compress delta weights through Singular Value Decomposition (SVD) by exploiting their low-rank properties.Finally, we introduce a semi-dynamical structured pruning strategy for the base weights, combining static and dynamic redundancy analysis to achieve further parameter reduction while maintaining input adaptivity. In this way, our $D^2$-MoE successfully compact MoE LLMs to high compression ratios without additional training. Extensive experiments highlight the superiority of our approach, with over 13\% performance gains than other compressors on Mixtral|Phi-3.5|DeepSeek|Qwen2 MoE LLMs at 40$\sim$60\% compression rates. Codes are available in supplementary materials.
Paperid:117
Authors:Ziru Wang · Mengmeng Wang · Jade Dai · Teli Ma · Guo-Jun Qi · Yong Liu · Guang Dai · Jingdong Wang
Abstract:
Integrating natural language instructions and visual perception with decisionmaking is a critical challenge for embodied agents. Existing methods often struggle to balance the conciseness of language commands with the richness of video content. To bridge the gap between modalities, we propose extracting key spatiotemporal patterns from video that capture visual saliency and temporal evolution, referred to as dynamic representation. Building on this, we introduce DynaMind, a framework that enhances decision-making through dynamic reasoning. Specifically, we design an adaptive FrameScorer to evaluate video frames based on semantic consistency and visual saliency, assigning each frame an importance score. These scores are used to filter redundant video content and synthesize compact dynamic representations. Leveraging these representations, we predict critical future dynamics and apply a dynamic-guided policy to generate coherent and context-aware actions. Extensive results demonstrate that DynaMind significantly outperforms the baselines across several simulation benchmarks and real-world scenarios.
Paperid:118
Authors:Ofri Eisen · Ron Dorfman · Kfir Levy
Abstract:
Decentralized learning has emerged as a powerful approach for handling large datasets across multiple machines in a communicationefficient manner. However, such methods often face scalability limitations, as increasing the number of machines beyond a certain point negatively impacts convergence rates. In this work, we proposeDecentralized Anytime SGD, a novel decentralized learning algorithm that significantly extends the critical parallelism threshold, enabling the effective use of more machines without compromising performance. Within the stochastic convex optimization (SCO) framework, we establish a theoretical upper bound on parallelism that surpasses the current state-of-the-art, allowing larger networks to achieve favorable statistical guarantees and closing the gap with centralized learning in highly connected topologies.
Paperid:119
Authors:Yangsibo Huang · Milad Nasr · Anastasios Angelopoulos · Nicholas Carlini · Wei-Lin Chiang · Christopher A. Choquette Choo · Daphne Ippolito · Matthew Jagielski · Katherine Lee · Ken Ziyu Liu · Ion Stoica · Florian Tramer · Chiyuan Zhang
Abstract:
Abstract:It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these votingbased benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95\%$ accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. Some of these defenses were present before our collaboration, such as bot protection with Cloudflare, malicious user detection, and rate limiting. Others, including reCAPTCHA and login are being integrated to strengthen the security in Chatbot Arena.
Paperid:120
Authors:Etowah Adams · Liam Bai · Minji Lee · Yiyang Yu · Mohammed AlQuraishi
Abstract:
Protein language models (pLMs) are powerful predictors of protein structure and function, learning through unsupervised training on millions of protein sequences. pLMs are thought to capture common motifs in protein sequences, but the specifics of pLM features are not well understood. Identifying these features would not only shed light on how pLMs work, but potentially uncover novel protein biology––studying the model to study the biology. Motivated by this, we train sparse autoencoders (SAEs) on the residual stream of a pLM, ESM2. By characterizing SAE features, we determine that pLMs use a combination of generic features and family-specific features to represent a protein. In addition, we demonstrate how known sequence determinants of properties such as thermostability and subcellular localization can be identified by linear probing of SAE features. For predictive features without known functional associations, we hypothesize their role in unknown mechanisms and provide visualization tools to aid their interpretation. Our study gives a better understanding of the limitations of pLMs, and demonstrates how SAE features can be used to help generate hypotheses for biological mechanisms.
Paperid:121
Authors:Zhuoling Li · Xiaogang Xu · Zhenhua Xu · Ser-Nam Lim · Hengshuang Zhao
Abstract:
Recent embodied agents are primarily built based on reinforcement learning (RL) or large language models (LLMs). Among them, RL agents are efficient for deployment but only perform very few tasks. By contrast, giant LLM agents (often more than 1000B parameters) present strong generalization while demanding enormous computing resources. In this work, we combine their advantages while avoiding the drawbacks by conducting the proposed referee RL on our developed large autoregressive model (LARM). Specifically, LARM is built upon a lightweight LLM (fewer than 5B parameters) and directly outputs the next action to execute rather than text. We mathematically reveal that classic RL feedbacks vanish in long-horizon embodied exploration and introduce a giant LLM based referee to handle this reward vanishment during training LARM. In this way, LARM learns to complete diverse open-world tasks without human intervention. Especially, LARM successfully harvests enchanted diamond equipment in Minecraft, which demands significantly longer decision-making chains than the highest achievements of prior best methods.
Paperid:122
Authors:Yuxiao Wen
Abstract:
Abstract:In combinatorial semibandits, a learner repeatedly selects from a combinatorial decision set of arms, receives the realized sum of rewards, and observes the rewards of the individual selected arms as feedback. In this paper, we extend this framework to include \emph{graph feedback}, where the learner observes the rewards of all neighboring arms of the selected arms in a feedback graph $G$. We establish that the optimal regret over a time horizon $T$ scales as $\widetilde{\Theta}(S\sqrt{T}+\sqrt{\alpha ST})$, where $S$ is the size of the combinatorial decisions and $\alpha$ is the independence number of $G$. This result interpolates between the known regrets $\widetilde\Theta(S\sqrt{T})$ under full information (i.e., $G$ is complete) and $\widetilde\Theta(\sqrt{KST})$ under the semi-bandit feedback (i.e., $G$ has only self-loops), where $K$ is the number of actions. A technical ingredient is to realize a convexified action using a random decision vector with negative correlations, a result that may be of independent interest.
Paperid:123
Authors:Fan Wang · Feiyu Jiang · Zifeng Zhao · Yi Yu
Abstract:
Dynamic pricing strategies are crucial for firms to maximize revenue by adjusting prices based on market conditions and customer characteristics. However, designing optimal pricing strategies becomes challenging when historical data are limited, as is often the case when launching new products or entering new markets. One promising approach to overcome this limitation is to leverage information from related products or markets to inform the focal pricing decisions. In this paper, we explore transfer learning for nonparametric contextual dynamic pricing under a covariate shift model, where the marginal distributions of covariates differ between source and target domains while the reward functions remain the same. We propose a novel Transfer Learning for Dynamic Pricing (TLDP) algorithm that can effectively leverage precollected data from a source domain to enhance pricing decisions in the target domain. The regret upper bound of TLDP is established under a simple Lipschitz condition on the reward function. To establish the optimality of TLDP, we further derive a matching minimax lower bound, which includes the target-only scenario as a special case and is presented for the first time in the literature. Extensive numerical experiments validate our approach, demonstrating its superiority over existing methods and highlighting its practical utility in real-world applications.
Paperid:124
Authors:Chongyu Fan · jinghan jia · Yihua Zhang · Anil Ramakrishna · Mingyi Hong · Sijia Liu
Abstract:
The LLM unlearning technique has recently been introduced to comply with data regulations and address the safety and ethical concerns of LLMs by removing the undesired datamodel influence. However, state-of-the-art unlearning methods face a critical vulnerability: they are susceptible to ``relearning'' the removed information from a small number of forget data points, known as relearning attacks. In this paper, we systematically investigate how to make unlearned models robust against such attacks. For the first time, we establish a connection between robust unlearning and sharpness-aware minimization (SAM) through a unified robust optimization framework, in an analogy to adversarial training designed to defend against adversarial attacks. Our analysis for SAM reveals that smoothness optimization plays a pivotal role in mitigating relearning attacks. Thus, we further explore diverse smoothing strategies to enhance unlearning robustness. Extensive experiments on benchmark datasets, including WMDP and MUSE, demonstrate that SAM and other smoothness optimization approaches consistently improve the resistance of LLM unlearning to relearning attacks. Notably, smoothness-enhanced unlearning also helps defend against (input-level) jailbreaking attacks, broadening our proposal's impact in robustifying LLM unlearning.
Paperid:125
Authors:Tianyi Zhang · Junda Su · Aditya Desai · Oscar Wu · Zhaozhuo Xu · Anshumali Shrivastava
Abstract:
Abstract:Adapting pretrained large language models (LLMs) is crucial but challenging due to their enormous size. Parameter-efficient fine-tuning (PEFT) techniques typically employ additive adapters applied to frozen model weights. To further reduce memory usage, model weights can be compressed through quantization. However, existing PEFT methods often yield suboptimal model quality due to restrictive assumptions, such as imposing low-rank constraints on adapters to reduce trainable parameters. We find that sketching, a popular data compression technique, can serve as an efficient adaptation strategy for LLMs while avoiding low-rank assumptions. We introduce SketchTune, a compressive adaptation strategy that compresses LLM weights into compact fine-tunable sketches, integrating compression and adaptation into a unified framework. This integration eliminates the need for complex two-path computation common in existing PEFT techniques, enabling faster and more memory-efficient training and inference. SketchTune is supported by mathematical insights into matrix classes that are better approximated using sketching rather than low-rank methods. Our rigorous evaluations with Llama-1/2/3 models demonstrate that SketchTune outperforms leading PEFT methods across diverse tasks including math problem-solving, common sense reasoning, and instruction following, while using substantially smaller base models and comparable trainable parameters. As a highlight, SketchTune outperforms LoRA, DoRA, and S2FT on commonsense and math benchmarks using 2.6-3.5$\times$ smaller base models and exceeds LoftQ in accuracy by 14.48\% on GSM8K with 7.3$\times$ fewer trainable parameters.
Paperid:126
Authors:Rujikorn Charakorn · Edoardo Cetin · Yujin Tang · Robert Lange
Abstract:
While Foundation Models provide a general tool for rapid content creation, they regularly require taskspecific adaptation. Traditionally, this exercise involves careful curation of datasets and repeated fine-tuning of the underlying model. Fine-tuning techniques enable practitioners to adapt foundation models for many new applications but require expensive and lengthy training while being notably sensitive to hyper-parameter choices. To overcome these limitations, we introduce Text-to-LoRA (T2L), a model capable of adapting Large Language Models on the fly solely based on a natural language description of the target task. T2L is a hypernetwork trained to construct LoRAs in a single inexpensive forward pass. After training T2L on a suite of 9 pre-trained LoRA adapters (GSM8K, Arc, etc.), we show that the ad-hoc reconstructed LoRA instances match the performance of task-specific adapters across the corresponding test sets.Furthermore, T2L can compress hundreds of LoRA instances and zero-shot generalize to entirely unseen tasks. This approach provides a significant step towards democratizing the specialization of foundation models and enables language-based adaptation with minimal compute requirements. Our code and pre-trained checkpoints will be available upon publication.
Paperid:127
Authors:Kefan Dong · Tengyu Ma
Abstract:
A fundamental challenge in formal theorem proving by LLMs is the lack of highquality training data. Although reinforcement learning or expert iteration partially mitigates this issue by alternating between LLM generating proofs and finetuning them on correctly generated ones, performance quickly plateaus due to the scarcity of correct proofs (sparse rewards). To keep improving the models with limited data, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them. We design the Self-play Theorem Prover (STP) that simultaneously takes on two roles, conjecturer and prover, each providing training signals to the other. The conjecturer is trained iteratively on previously generated conjectures that are barely provable by the current prover, which incentivizes it to generate increasingly challenging conjectures over time. The prover attempts to prove the conjectures with standard expert iteration. We evaluate STP with both Lean and Isabelle formal versifiers. With 19.8 billion tokens generated during the training in Lean, STP proves 26.3% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.2% achieved through expert iteration. The final model achieves state-of-the-art performance among whole-proof generation methods on miniF2F-test (61.1%, pass@3200), Proofnet-test (23.1%, pass@3200) and PutnamBench (8/644, pass@64).
Paperid:128
Authors:Zhehan Zhao · Lu Bai · Lixin Cui · Ming Li · Ziyu Lyu · Lixiang Xu · Yue Wang · Edwin Hancock
Abstract:
Graph Neural Networks (GNNs) are powerful tools for graph learning, and one key operation is downsampling or pooling, which helps learn effective graph representations from node embeddings. In this paper, we propose a novel graph pooling operation, namely the EdgeNode Attention-based Hierarchical Pooling (ENAHPool), for GNNs to learn graph representations. Unlike previous cluster-based hierarchical pooling methods that rely on ambiguous node assignments and uniform aggregation of edge-node structural information, ENAHPool assigns each node entirely to a cluster. It then incorporates attention mechanisms to perform weighted aggregation, combining both node features within clusters and edge connectivity strengths between clusters at each pooling step. Additionally, since pooling operations are defined in conjunction with Message-Passing Neural Networks (MPNNs), we propose a Multi-Distance MPNN (MD-MPNN) architecture that enables nodes to directly receive feature information from neighbors at various distances, effectively mitigating the notorious over-squashing issue found in conventional MPNNs. Furthermore, it harnesses the valuable information implicit in edge connectivity strengths, further enhancing the performance of ENAHPool. Experiments demonstrate the effectiveness of the proposed ENAHPool operation associated with the MD-MPNN architecture.
Paperid:129
Authors:Ziyue Qiao · Qianyi Cai · Hao Dong · Jiawei Gu · Pengyang Wang · Meng Xiao · Xiao Luo · Hui Xiong
Abstract:
This paper addresses the challenge of graph domain adaptation on evolving, multiple outof-distribution (OOD) graphs.Conventional graph domain adaptation methods are confined to single-step adaptation, making them ineffective in handling continuous domain shifts and prone to catastrophic forgetting. This paper introduces the Graph Continual Adaptive Learning (GCAL) method, designed to enhance model sustainability and adaptability across various graph domains. GCAL employs a bilevel optimization strategy.The "adapt" phase uses an information maximization approach to fine-tune the model with new graph domains while re-adapting past memories to mitigate forgetting. Concurrently, the "generate memory", guided by a theoretical lower bound derived from information bottleneck theory, involves a variational memory graph generation module to condense original graphs into memories. Extensive experimental evaluations demonstrate that GCAL substantially outperforms existing methods in terms of adaptability and knowledge retention.
Paperid:130
Authors:Hongyao Chen · Tianyang Xu · Xiaojun Wu · Josef Kittler
Abstract:
Batch Normalisation (BN) is widely used in conventional deep neural network training to harmonise the inputoutput distributions for each batch of data.However, federated learning, a distributed learning paradigm, faces the challenge of dealing with non-independent and identically distributed data among the client nodes. Due to the lack of a coherent methodology for updating BN statistical parameters, standard BN degrades the federated learning performance.To this end, it is urgent to explore an alternative normalisation solution for federated learning. In this work, we resolve the dilemma of the BN layer in federated learning by developing a customised normalisation approach, Hybrid Batch Normalisation (HBN). HBN separates the update of statistical parameters (i.e. , means and variances used for evaluation) from that of learnable parameters (i.e. , parameters that require gradient updates), obtaining unbiased estimates of global statistical parameters in distributed scenarios. In contrast with the existing solutions, we emphasise the supportive power of global statistics for federated learning. The HBN layer introduces a learnable hybrid distribution factor, allowing each computing node to adaptively mix the statistical parameters of the current batch with the global statistics. Our HBN can serve as a powerful plugin to advance federated learning performance.It reflects promising merits across a wide range of federated learning settings, especially for small batch sizes and heterogeneous data.
Paperid:131
Authors:Jaehyun Kwak · Izaaz Inhar · Se-Young Yun · Sung-Ju Lee
Abstract:
Composed Image Retrieval (CIR) retrieves relevant images based on a reference image and accompanying text describing desired modifications. However, existing CIR methods only focus on retrieving the target image and disregard the relevance of other images. This limitation arises because most methods employing contrastive learningwhich treats the target image as positive and all other images in the batch as negatives-can inadvertently include false negatives. This may result in retrieving irrelevant images, reducing user satisfaction even when the target image is retrieved. To address this issue, we propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which optimizes a reward model objective to reduce false negatives. Additionally, we introduce a hard negative sampling strategy that selects images positioned between two steep drops in relevance scores following the target image, to effectively filter false negatives. In order to evaluate CIR models on their alignment with human satisfaction, we create Human-Preference FashionIQ (HP-FashionIQ), a new dataset that explicitly captures user preferences beyond target retrieval. Extensive experiments demonstrate that QuRe achieves state-of-the-art performance on FashionIQ and CIRR datasets while exhibiting the strongest alignment with human preferences on HP-FashionIQ dataset.
Paperid:132
Authors:Yi-Rui Yang · Chang-Wei Shi · Wu-Jun Li
Abstract:
Byzantinerobust distributed learning (BRDL), which refers to distributed learning with potential faulty workers (also known as Byzantine workers), has recently attracted much research attention. Robust aggregators are widely used in existing BRDL methods to obtain robustness against Byzantine workers. However, Byzantine workers do not always exist in applications. As far as we know, there is almost no existing work theoretically investigating the effect of using robust aggregators when there is no Byzantine worker. In view of this challenge, we theoretically analyze the aggregation error for robust aggregators when there is no Byzantine worker. Specifically, we show that the worst-case aggregation error without attack increases with the increase of the number of Byzantine workers that a robust aggregator can tolerate. The theoretical result reveals the tension between Byzantine robustness and no-attack accuracy. Furthermore, we provide lower bounds for the convergence rate of gradient descent with robust aggregators for non-convex objective functions and objective functions that satisfy the Polyak-Lojasiewicz (PL) condition, respectively. We also prove the tightness of the lower bounds. The lower bounds for convergence rate reveal similar tension between Byzantine robustness and no-attack accuracy. Empirical results further support our theoretical findings.
Paperid:133
Authors:TianChi Xing · Bonan Li · Congying Han · XINMIN QIU · Zicheng Zhang · Tiande Guo
Abstract:
Multiview Diffusion has greatly advanced the development of 3D content creation by generating multiple images from distinct views, achieving remarkable photorealistic results. However, existing works are still vulnerable to inconsistent 3D geometric structures (commonly known as Janus Problem) and severe artifacts. In this paper, we introduce MIRROR, a versatile plug-and-play method that rectifies such inconsistencies in a training-free manner, enabling the acquisition of high-fidelity, realistic structures without compromising diversity. Our key idea focuses on tracing the motion trajectory of physical points across adjacent viewpoints, enabling rectifications based on neighboring observations of the same region. Technically, MIRROR comprises two core modules: Trajectory Tracking Module (TTM)for pixel-wise trajectory tracking that labels identical points across views, and Feature Rectification Module (FRM) for explicitly adjustment of each pixel embedding on noisy synthesized images by minimizing the distance to corresponding block features in neighboring views, thereby achieving consistent outputs. Extensive evaluations demonstrate that MIRROR can seamlessly integrate with a diverse range of off-the-shelf multi-view diffusion models, significantly enhancing both the consistency and the fidelity in an efficient way. Code will be released upon acceptance.
Paperid:134
Authors:Maxime Burchi · Radu Timofte
Abstract:
Abstract:The Dreamer algorithm has recently obtained remarkable performance across diverse environment domains by training powerful agents with simulated trajectories. However, the compressed nature of its world model's latent space can result in the loss of crucial information, negatively affecting the agent's performance. Recent approaches, such as $\Delta$IRIS, address this limitation by training more accurate world models. However, these methods require training agents directly from pixels, which reduces training efficiency and prevents the agent from benefiting from the inner representations learned by the world model. In this work, we propose an alternative approach to world modeling that is both accurate and efficient. We introduce EMERALD (Efficient MaskEd latent tRAnsformer worLD model), a world model using a spatial latent state with MaskGIT predictions to generate accurate trajectories in latent space and improve the agent performance. On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation.
Paperid:135
Authors:Chenghua Liu · Minbo Gao · Zhengfeng Ji · Ying
Abstract:
Abstract:Graph sparsification serves as a foundation for many algorithms, such as approximation algorithms for graph cuts and Laplacian system solvers. As its natural generalization, hypergraph sparsification has recently gained increasing attention, with broad applications in graph machine learning and other areas. In this work, we propose the first quantum algorithm for hypergraph sparsification, addressing an open problem proposed by Apers and de Wolf (FOCS'20). For a weighted hypergraph with $n$ vertices, $m$ hyperedges, and rank $r$, our algorithm outputs a nearlinear size $\varepsilon$-spectral sparsifier in time $\widetilde O(r\sqrt{mn}/\varepsilon)$. This algorithm matches the quantum lower bound for constant $r$ and demonstrates quantum speedup when compared with the state-of-the-art $\widetilde O(mr)$-time classical algorithm. As applications, our algorithm implies quantum speedups for computing hypergraph cut sparsifiers, approximating hypergraph mincuts and hypergraph $s$-$t$ mincuts.
Paperid:136
Authors:Pratinav Seth · Michelle Lin · BREFO YAW · Jade Boutot · Mary Kang · David Rolnick
Abstract:
Millions of abandoned oil and gas wells are scattered across the world, leaching methane into the atmosphere and toxic compounds into the groundwater. Many of these locations are unknown, preventing the wells from being plugged and their polluting effects averted. Remote sensing is a relatively unexplored tool for pinpointing abandoned wells at scale. We introduce the first largescale Benchmark dataset for this problem, leveraging high-resolution multi-spectral satellite imagery from Planet Labs. Our curated Dataset comprises over 213,000 wells (abandoned, suspended, and active) from Alberta, a region with especially high well density, sourced from the Alberta Energy Regulator and verified by domain experts. We evaluate baseline algorithms for well detection and segmentation, showing the promise of computer vision approaches but also significant room for improvement.
Paperid:137
Authors:Mengxiao Zhang · Yingfei Wang · Haipeng Luo
Abstract:
Abstract:A recent work by Schlisselberg et al. (2024) studies a delayas-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is only of order $D\Delta_{\max}\log T$, where $T$ is the total horizon, $D$ is the maximum delay, and $\Delta_{\max}$ is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2024). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking the **volumetric spanners** of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.
Paperid:138
Authors:Fabrizio Boninsegna · Jacob Imola · Anders Aamand · Abigail Gentle · Rasmus Pagh
Abstract:
Abstract:Distributed data analysis is a large and growing field driven by a massive proliferation of user devices, and by privacy concerns surrounding the centralised storage of data. We consider two \emph{adaptive} algorithms for estimating one quantile (e.g.~the median) when each user holds a single data point lying in a domain $[B]$ that can be queried once through a private mechanism; one under local differential privacy (LDP) and another for shuffle differential privacy (shuffleDP). In the adaptive setting we present an $\varepsilon$-LDP algorithm which can estimate any quantile within error $\alpha$ only requiring $O(\frac{\log B}{\varepsilon^2\alpha^2})$ users, and an $(\varepsilon,\delta)$-shuffle DP algorithm requiring only $\widetilde{O}((\frac{1}{\varepsilon^2}+\frac{1}{\alpha^2})\log B)$ users. Prior (nonadaptive) algorithms require more users by several logarithmic factors in $B$. We further provide a matching lower bound for adaptive protocols, showing that our LDP algorithm is optimal in the low-$\varepsilon$ regime. Additionally, we establish lower bounds against non-adaptive protocols which paired with our understanding of the adaptive case, proves a fundamental separation between these models.
Paperid:139
Authors:Daoyuan Chen · Haibin Wang · Yilun Huang · Ce Ge · Yaliang Li · Bolin Ding · Jingren Zhou
Abstract:
The emergence of multimodal large models has advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of modelcentric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a new sandbox suite tailored for integrated data-model co-development. This sandbox provides a feedback-driven experimental platform, enabling cost-effective iteration and guided refinement of both data and models. Our proposed ``Probe-Analyze-Refine'' workflow, validated through practical use cases on multimodal tasks such as image-text pre-training with CLIP, image-to-text generation with LLaVA-like models, and text-to-video generation with DiT-based models, yields transferable and notable performance boosts, such as topping the VBench leaderboard.Extensive experiments also uncover fruitful insights into the interplay between data quality, diversity, model behavior, and computational costs. All codes, datasets, and models are open-sourced to foster future research and applications that would otherwise be infeasible due to the lack of a dedicated co-development infrastructure.
Paperid:140
Authors:Mingzhe Yang · Sihao Lin · Changlin Li · Xiaojun Chang
Abstract:
Large language models (LLMs) have revolutionized various AI applications. However, their billions of parameters pose significant challenges for practical deployment. Structured pruning is a hardwarefriendly compression technique and receives widespread attention. Nonetheless, existing literature typically targets a single structure of LLMs. We observe that different structure units within the LLM differently to inference cost and functionality. Therefore, pruning a single structure unit in isolation often results in an imbalance between the LLM's performance and efficiency after pruning. In addition, previous works mainly employ a prescribed pruning ratio. Since the significance of LLM modules may vary, it is ideal to distribute the pruning load to a specific structure unit according to its role within LLMs. To address the two issues, we propose a pruning method that targets multiple LLM modules with dynamic pruning ratios. Specifically, we find the intrinsic of LLM can guide us to determine the importance of each module and thus distribute the pruning load on demand, i.e., what to prune and how much to prune. This is achieved by quantifying the complex interactions within LLM. Through extensive experimentation on different language benchmarks and LLMs, we can demonstrate that our method effectively resolves the trade-off between efficiency and performance.
Paperid:141
Authors:Angelica Chen · Samuel Stanton · Frances Ding · Robert Alberstein · Andrew Watkins · Richard Bonneau · Vladimir Gligorijevic · Kyunghyun Cho · Nathan Frey
Abstract:
Although large language models (LLMs) have shown promise in biomolecule optimization problems, they incur heavy computational costs and struggle to satisfy precise constraints. On the other hand, specialized solvers like LaMBO2 offer efficiency and fine-grained control but require more domain expertise. Comparing these approaches is challenging due to expensive laboratory validation and inadequate synthetic benchmarks. We address this by introducing Ehrlich functions, a synthetic test suite that captures the geometric structure of biophysical sequence optimization problems. With prompting alone, off-the-shelf LLMs struggle to optimize Ehrlich functions. In response, we propose \textsc{LLOME} (Language Model Optimization with Margin Expectation), a bilevel optimization routine for online black-box optimization.When combined with a novel preference learning loss, we find \textsc{LLOME} can not only learn to solve some Ehrlich functions, but can even outperform LaMBO-2 on moderately difficult Ehrlich variants. However, \textsc{LLOME} is comparable to \textsc{LaMBO-2} on very easy or difficult variants, exhibits some likelihood-reward miscalibration, and struggles without explicit rewards. Our results indicate LLMs can provide significant benefits in some cases, but specialized solvers are still competitive and incur less overhead.
Paperid:142
Authors:Xing Li · Zeyu Xing · Yiming Li · Linping Qu · Huiling Zhen · Yiwu Yao · Wulong Liu · Sinno Jialin Pan · Mingxuan Yuan
Abstract:
KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batchsize scenarios while preserving LLMs effectiveness.However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we thoroughly analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference.To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 5.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 38.3% compared with KV8 quantization over various context lengths.
Paperid:143
Authors:Hailong Jiang · Jianfeng Zhu · Yao Wan · Bo Fang · Hongyu Zhang · Ruoming Jin · Qiang Guan
Abstract:
Intermediate Representations(IRs) are essential in compiler design and program analysis, yet their comprehension byLarge Language Models(LLMs) remains underexplored. This paper presents a pioneering empirical study to investigate the capabilities of LLMs, including GPT4, GPT-3, Gemma 2, LLaMA 3.1, and Code Llama, in understanding IRs. We analyze their performance across four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code summarization, and execution reasoning. Our results indicate that while LLMs demonstrate competence in parsing IR syntax and recognizing high-level structures, they struggle with control flow reasoning, execution semantics, and loop handling. Specifically, they often misinterpret branching instructions, omit critical IR operations, and rely on heuristic-based reasoning, leading to errors in CFG reconstruction, IR decompilation, and execution reasoning. The study underscores the necessity for IR-specific enhancements in LLMs, recommending fine-tuning on structured IR datasets and integration of explicit control flow models to augment their comprehension and handling of IR-related tasks.
Paperid:144
Authors:Haoyang Luo · Linwei Tao · Minjing Dong · Chang Xu
Abstract:
Model calibration seeks to ensure that models produce confidence scores that accurately reflect the true likelihood of their predictions being correct. However, existing calibration approaches are fundamentally tied to datasets of onehot labels implicitly assuming full certainty in all the annotations. Such datasets are effective for classification but provides insufficient knowledge of uncertainty for model calibration, necessitating the curation of datasets with numerically rich ground-truth confidence values. However, due to the scarcity of uncertain visual examples, such samples are not easily available as real datasets. In this paper, we introduce calibration-aware data augmentation to create synthetic datasets of diverse samples and their ground-truth uncertainty. Specifically, we present Calibration-aware Semantic Mixing (CSM), a novel framework that generates training samples with mixed class characteristics and annotates them with distinct confidence scores via diffusion models. Based on this framework, we propose calibrated reannotation to tackle the misalignment between the annotated confidence score and the mixing ratio during the diffusion reverse process. Besides, we explore the loss functions that better fit the new data representation paradigm. Experimental results demonstrate that CSM achieves superior calibration compared to the state-of-the-art calibration approaches.
Paperid:145
Authors:Kunal Jha · Wilka Carvalho · Yancheng Liang · Simon Du · Max Kleiman-Weiner · Natasha Jaques
Abstract:
Zeroshot coordination (ZSC), the ability to adapt to a new partner in a cooperative task, is a critical component of human-compatible AI. While prior work has focused on training agents to cooperate on a single task, these specialized models do not generalize to new tasks, even if they are highly similar. Here, we study how reinforcement learning on adistribution of environments with a single partnerenables learning general cooperative skills that support ZSC withmany new partners on many new problems. We introducetwoJax-based, procedural generators that create billions of solvable coordination challenges. We develop a new paradigm calledCross-Environment Cooperation (CEC), and show that it outperforms competitive baselines quantitatively and qualitatively when collaborating with real people. Our findings suggest that learning to collaborate across many unique scenarios encourages agents to develop general norms, which prove effective for collaboration with different partners. Together, our results suggest a new route toward designing generalist cooperative agents capable of interacting with humans without requiring human data.
Paperid:146
Authors:Hiroki Yanagisawa · Shunta Akiyama
Abstract:
This paper introduces a novel framework for survival analysis by reinterpreting it as a form of density estimation. Our algorithm postprocesses density estimation outputs to derive survival functions, enabling the application of any density estimation model to effectively estimate survival functions. This approach broadens the toolkit for survival analysis and enhances the flexibility and applicability of existing techniques. Our framework is versatile enough to handle various survival analysis scenarios, including competing risk models for multiple event types. It can also address dependent censoring when prior knowledge of the dependency between event time and censoring time is available in the form of a copula. In the absence of such information, our framework can estimate the upper and lower bounds of survival functions, accounting for the associated uncertainty.
Paperid:147
Authors:Benjamin Ruben · William Tong · Hamza Chaudhry · Cengiz Pehlevan
Abstract:
Abstract:Given a fixed budget for total model size, one must choose between training a single large model or combining the predictions of multiple smaller models. We investigate this tradeoff for ensembles of random-feature ridge regression models in both the overparameterized and underparameterized regimes. Using asymptotically sharp risk estimates, we prove that when a fixed number of parameters is distributed among $K$ independently trained models, the ridge-optimized test risk increases with $K$.Consequently, a single large model achieves optimal performance. We then ask when *near*-optimal performance can be achieved by ensembles.In the overparameterized regime, we show that, to leading order, the test error depends on ensemble size and model size only through the total feature count, so that overparameterized ensembles consistently achieve near-optimal performance.To understand underparameterized ensembles, we derive scaling laws for the test risk as a function of total parameter count when the ensemble size and parameters per ensemble member are jointly scaled according to a ``growth exponent'' $\ell$. While optimal error scaling is always achieved by increasing model size with a fixed ensemble size, our analysis identifies conditions on the kernel and task eigenstructure under which near-optimal scaling laws can be obtained by joint scaling of ensemble size and model size.
Paperid:148
Authors:Samir Khaki · Xiuyu Li · Junxian Guo · Ligeng Zhu · Konstantinos N (Kostas) Plataniotis · Amir Yazdanbakhsh · Kurt Keutzer · Song Han · Zhijian Liu
Abstract:
Abstract:Finetuning LLMs is both computationally and memory-intensive. While parameter-efficient fine-tuning methods, such as DoRA, reduce the number of trainable parameters and lower memory usage, they do not decrease computational cost.In some cases, they even slow down fine-tuning.In this paper, we introduce *SparseLoRA*, a method that accelerates LLM fine-tuning through contextual sparsity.We propose a lightweight, training-free *SVD sparsity estimator* that dynamically selects a sparse subset of weights for loss and gradient computation.Also, we systematically analyze and address sensitivity across layers, tokens, and training steps.Our experimental results show that SparseLoRA reduces computational cost by up to **1.7$\times$** and a measured speedup of up to **1.4$\times$** while maintaining accuracy across various downstream tasks, including commonsense and arithmetic reasoning.We will release our code to encourage further research. This will support the development of fine-tuning methods that are both parameter- and computation-efficient.
Paperid:149
Authors:Qingqing Cao · Mahyar Najibi · Sachin Mehta
Abstract:
Pretraining strong vision or multimodal foundation models like CLIP relies on largescale datasets (e.g., image-text pairs) that may be noisy, potentially misaligned, and have long-tail distributions. Previous work has shown promising results in augmenting datasets by generating synthetic samples. However, they only support domain-specific ad hoc use cases (like for image or text alone) and are limited in data diversity due to a lack of fine-grained control over the synthesis process. We design a controllable image-text synthesis pipeline called CtrlSynth to enable data-efficient multimodal learning and improve vision and multimodal models in various use cases. The key idea is to decompose the visual semantics of an image into basic elements, apply user-specified control policies (e.g. remove, add, replace operations), and recompose them to synthesize images or texts. The decompose and recompose feature in CtrlSynth allows users to control data synthesis in a fine-grained manner by defining customized control policies to manipulate the basic elements. CtrlSynth leverages the capabilities of pretrained foundation models such as large language models (LLMs) or diffusion models (DMs) to reason and recompose basic elements such that synthetic samples are natural and composed in diverse ways. CtrlSynth pipeline is training-free and has a modular design, making it easy to support different pretrained models. CtrlSynth pipeline is also closed-loop, meaning it can synthesize text data based on the image or vice versa. Our evaluation shows that CtrlSynth samples substantially improve zero-shot classification, image-text retrieval, and compositional reasoning performance of CLIP models. We will publicly release the code and pipeline for future research.
Paperid:150
Authors:Tianhao Wu · Janice Lan · Weizhe Yuan · Jiantao Jiao · JASON WESTON · Sainbayar Sukhbaatar
Abstract:
LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning but can be applied toanytask. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.
Paperid:151
Authors:Wa-Kin Lei · Jun-Cheng Chen · Shang-Tse Chen
Abstract:
With the rise of large foundation models, split inference (SI) has emerged as a popular computational paradigm for deploying models across lightweight edge devices and cloud servers, addressing data privacy and computational cost concerns. However, most existing data reconstruction attacks have focused on smaller CNN classification models, leaving the privacy risks of foundation models in SI settings largely unexplored. To address this gap, we propose a novel data reconstruction attack based on guided diffusion, which leverages the rich prior knowledge embedded in a latent diffusion model (LDM) pretrained on a large-scale dataset. Our method performs iterative reconstruction on the LDM’s learned image prior, effectively generating high-fidelity images resembling the original data from their intermediate representations (IR). Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, both qualitatively and quantitatively, in reconstructing data from deep-layer IRs of the vision foundation model. The results highlight the urgent need for more robust privacy protection mechanisms for large models in SI scenarios.
Paperid:152
Authors:Mingyu Kang · Yong Suk Choi
Abstract:
Textto-image diffusion models have achieved remarkable success in generating high-quality and diverse images. Building on these advancements, diffusion models have also demonstrated exceptional performance in text-guided image editing. A key strategy for effective image editing involves inverting the source image into editable noise maps associated with the target image. However, previous inversion methods face challenges in adhering closely to the target text prompt. The limitation arises because inverted noise maps, while enabling faithful reconstruction of the source image, restrict the flexibility needed for desired edits. To overcome this issue, we propose Editable Noise Map Inversion (ENM Inversion), a novel inversion technique that searches for optimal noise maps to ensure both content preservation and editability. We analyze the properties of noise maps for enhanced editability. Based on this analysis, our method introduces an editable noise refinement that aligns with the desired edits by minimizing the difference between the reconstructed and edited noise maps. Extensive experiments demonstrate that ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts. Our approach can also be easily applied to video editing, enabling temporal consistency and content manipulation across frames.
Paperid:153
Authors:Yiğit Narter · Alihan Hüyük · M van der Schaar · Cem Tekin
Abstract:
Current screening programs that focus on improving patient health while minimizing screening costs are tailored for individual diseases. Designing unified screening programs for multiple diseases requires carefully balancing competing disease risks, which is an open problem. In this work, we address this problem by casting unified screening as a referral problem, in which we choose to activate a subset of screening policies for individual diseases by accounting for competing risks that influence patient outcomes. We introduce a novel optimization framework that incorporates disease risks, budget constraints, and diagnostic error limits and characterize the structural properties of the optimal referral policy. For the unified screening of two diseases, we show that the optimal activation threshold for the screening of one disease depends on the risk of the other, resulting in decision boundaries with distinct riskdependent profiles. We compare our unified model with independent screening programs that apply isolated activation thresholds for screening of each disease. Our approach optimizes screening decisions collectively, improving overall survival outcomes, particularly for patients with high disease risks.
Paperid:154
Authors:Julien Pourcel · Cédric Colas · Pierre-Yves Oudeyer
Abstract:
Many program synthesis tasks prove too challenging for even stateof-the-art language models to solve in single attempts. Search-based evolutionary methods offer a promising alternative by exploring solution spaces iteratively, but their effectiveness remain limited by the fixed capabilities of the underlying generative model. We propose SOAR, a method that learns program synthesis by integrating language models into a self-improving evolutionary loop. SOAR alternates between (1) an evolutionary search that uses an LLM to generate and refine candidate solutions, and (2) a hindsight learning phase that converts search attempts into valid problem-solution pairs used to fine-tune the LLM's generation and refinement capabilities; enabling increasingly effective search in subsequent iterations. On the challenging ARC-AGI benchmark, SOAR achieves significant performance gains across model scales and iterations, leveraging positive transfer between the generation and refinement finetuning tasks. These improvements carry over to test-time adaptation, where SOAR achieves state-of-the-art results among program synthesis methods based on open-source LLMs.
Paperid:155
Authors:Wei Chen · Jun-Xiang Mao · Xiaozheng Wang · Min-Ling Zhang
Abstract:
The learnware paradigm aims to establish a learnware dock system that contains numerous leanwares, each consisting of a welltrained model and a specification, enabling users to reuse high-performing models for their tasks instead of training from scratch. The specification, as a unique characterization of the model's specialties, dominates the effectiveness of model reuse. Existing specification methods mainly employ distribution alignment to generate specifications. However, this approach overlooks the model's discriminative performance, hindering an adequate specialty characterization. In this paper, we claim that it is beneficial to incorporate such discriminative performance for high-quality specification generation. Accordingly, a novel specification approach named Dali, i.e., Learnware Specification via Dual ALIgnment, is proposed. In Dali, the characterization of the model's discriminative performance is modeled as discriminative alignment, which is considered along with distribution alignment in the specification generation process. Theoretical and empirical analyses clearly demonstrate that the proposed approach is capable of facilitating model reuse in the learnware paradigm with high-quality specification generation.
Paperid:156
Authors:Daniel Eftekhari · Vardan Papyan
Abstract:
The normal distribution plays a central role in information theory – it is at the same time the bestcase signal and worst-case noise distribution, has the greatest representational capacity of any distribution, and offers an equivalence between uncorrelatedness and independence for joint distributions. Accounting for the mean and variance of activations throughout the layers of deep neural networks has had a significant effect on facilitating their effective training, but seldom has a prescription for precisely what distribution these activations should take, and how this might be achieved, been offered. Motivated by the information-theoretic properties of the normal distribution, we address this question and concurrently present normality normalization: a novel normalization layer which encourages normality in the feature representations of neural networks using the power transform and employs additive Gaussian noise during training. Our experiments comprehensively demonstrate the effectiveness of normality normalization, in regards to its generalization performance on an array of widely used model and dataset combinations, its strong performance across various common factors of variation such as model width, depth, and training minibatch size, its suitability for usage wherever existing normalization layers are conventionally used, and as a means to improving model robustness to random perturbations.
Paperid:157
Authors:Tiansheng Wen · Yifei Wang · Zequn Zeng · Zhong Peng · Yudi Su · Xinyang Liu · Bo Chen · Hongwei Liu · Stefanie Jegelka · Chenyu You
Abstract:
Many largescale systems rely on high-quality deep representations (embeddings) to facilitate tasks like retrieval, search, and generative modeling. Matryoshka Representation Learning (MRL) recently emerged as a solution for adaptive embedding lengths, but it requires full model retraining and suffers from noticeable performance degradations at short lengths. In this paper, we show thatsparse codingoffers a compelling alternative for achieving adaptive representation with minimal overhead and higher fidelity. We proposeContrastive Sparse Representation(CSR), a method that specifies pre-trained embeddings into a high-dimensional butselectively activatedfeature space. By leveraging lightweight autoencoding and task-aware contrastive objectives, CSR preserves semantic quality while allowing flexible, cost-effective inference at different sparsity levels. Extensive experiments on image, text, and multimodal benchmarks demonstrate that CSR consistently outperforms MRL in terms of both accuracy and retrieval speed—often by large margins—while also cutting training time to a fraction of that required by MRL. Our results establish sparse coding as a powerful paradigm for adaptive representation learning in real-world applications where efficiency and fidelity are both paramount.
Paperid:158
Authors:Peimeng Guan · Mark Davenport
Abstract:
Inverse problems aim to reconstruct unseen data from corrupted or perturbed measurements. While most work focuses on improving reconstruction quality, generalization accuracy and robustness are equally important, especially for safetycritical applications. Model-based architectures (MBAs), such as loop unrolling methods, are considered more interpretable and achieve better reconstructions. Empirical evidence suggests that MBAs are more robust to perturbations than black-box solvers, but the accuracy-robustness tradeoff in MBAs remains underexplored. In this work, we propose a simple yet effective training scheme for MBAs, called SGD jittering, which injects noise iteration-wise during reconstruction. We theoretically demonstrate that SGD jittering not only generalizes better than the standard mean squared error training but is also more robust to average-case attacks. We validate SGD jittering using denoising toy examples, seismic deconvolution, and single-coil MRI reconstruction. Both SGD jittering and its SPGD extension yield cleaner reconstructions for out-of-distribution data and demonstrates enhanced robustness against adversarial attacks.
Paperid:159
Authors:Felipe Nuti · Tim Franzmeyer · Joao Henriques
Abstract:
Abstract:Past work has studied the effects of finetuning on large language models' (LLMs) overall performance on certain tasks. However, a way to quantitatively and systematically analyze its effect on individual outputs is still lacking.In this work, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method takes into account the model's intermediate hidden states, giving a more fine-grained insight into the effects of fine-tuning than a simple comparison of the final outputs of pre-trained and fine-tuned models.We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component.Empirically, we find that one can steer model behavior and performance by up- or down-scaling the fine-tuning component during the forward pass.Motivated by this finding and our theoretical analysis, we define the Tuning Contribution ($\mathrm{TuCo}$) in terms of the ratio of the magnitudes fine-tuning component and the pre-training component.We find that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces the Tuning Contribution, and that $\mathrm{TuCo}$ is consistently lower on prompts where the attacks succeed compared to ones where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of these attacks.In short, $\mathrm{TuCo}$ enables the quantitative study of how fine-tuning influences model behavior and safety, and vice-versa.
Paperid:160
Authors:Jessica Dai · Paula Gradu · Inioluwa Raji · Benjamin Recht
Abstract:
When an individual reports a negative interaction with some system, how can their personal experience be contextualized within broader patterns of system behavior? We study theincident databaseproblem, where individual reports of adverse events arrive sequentially, and are aggregated over time. In this work, our goal is to identify whether there are subgroups—defined by any combination of relevantfeatures—that are disproportionately likely to experience harmful interactions with the system. We formalize this problem as a sequential hypothesis test, and identify conditions on reporting behavior that are sufficient for making inferences about disparities in true rates of harm across subgroups. We show that algorithms for sequential hypothesis tests can be applied to this problem with a standard multiple testingcorrection. We then demonstrate our method on realworld datasets, including mortgage decisions and vaccine side effects; on each, our method (re-)identifies subgroups known to experience disproportionate harm using only a fraction of the data that was initially used to discover them.
Paperid:161
Authors:Wei Yao · Zeliang Zhang · Huayi Tang · Yong Liu
Abstract:
Model ensemble adversarial attack has become a powerful method for generating transferable adversarial examples that can target even unknown models, but its theoretical foundation remains underexplored. To address this gap, we provide early theoretical insights that serve as a roadmap for advancing model ensemble adversarial attack. We first define transferability error to measure the error in adversarial transferability, alongside concepts of diversity and empirical model ensemble Rademacher complexity. We then decompose the transferability error into vulnerability, diversity, and a constant, which rigidly explains the origin of transferability error in model ensemble attack: the vulnerability of an adversarial example to ensemble components, and the diversity of ensemble components. Furthermore, we apply the latest mathematical tools in information theory to bound the transferability error using complexity and generalization terms, validating three practical guidelines for reducing transferability error: (1) incorporating more surrogate models, (2) increasing their diversity, and (3) reducing their complexity in cases of overfitting. Finally, extensive experiments with 54 models validate our theoretical framework, representing a significant step forward in understanding transferable model ensemble adversarial attacks.
Paperid:162
Authors:Haowen Ma · Zhiguo Long · Hua Meng
Abstract:
Densitybased mode-seeking methods generate a \emph{density-ascending dependency} from low-density points towards higher-density neighbors.Current mode-seeking methods identify modes by breaking some dependency connections, but relying heavily on local data characteristics, requiring case-by-case threshold settings or human intervention to be effective for different datasets.To address this issue, we introduce a novel concept called \emph{typicality}, by exploring the \emph{locally defined} dependency from a \emph{global} perspective, to quantify how confident a point would be a mode. We devise an algorithm that effectively and efficiently identifies modes with the help of the global-view typicality. To implement and validate our idea, we design a clustering method called TANGO, which not only leverages typicality to detect modes, but also utilizes graph-cut with an improved \emph{path-based similarity} to aggregate data into the final clusters. Moreover, this paper also provides some theoretical analysis on the proposed algorithm. Experimental results on several synthetic and extensive real-world datasets demonstrate the effectiveness and superiority of TANGO.
Paperid:163
Authors:Mengyang Sun · Yihao Wang · Tao Feng · Dan Zhang · Yifan Zhu · Jie Tang
Abstract:
In order to streamline the finetuning of foundation models, Low-Rank Adapters (LoRAs) have been substantially adopted across various fields, including instruction tuning and domain adaptation. The underlying concept of LoRA involves decomposing a full-rank matrix into the product of two lower-rank matrices, which reduces storage consumption and accelerates the training process. Furthermore, to address the limited expressive capacity of LoRA, the Mixture-of-Expert (MoE) has been introduced for incorporating multiple LoRA adapters. The integration of LoRA experts leads to a visible improvement across several downstream scenes. However, the mixture of LoRAs (MoE-LoRA) still exhibits its low robustness during tuning and inferring. Inspired by the Riemannian Preconditioners which train LoRA as a sub-space projector, we propose a new training strategy for MoE-LoRA, to stabilize and boost its feature learning procedure by multi-space projections. Examinations on SGD and AdamW optimizers demonstrate the effectiveness of our methodology. The code will be available upon publication.
Paperid:164
Authors:Peijia Qin · Jianguo Zhang
Abstract:
Deep neural networks incorporating discrete latent variables have shown significant potential in sequence modeling.A notable approach is to leverage vector quantization (VQ) to generate discrete representations within a codebook.However, its discrete nature prevents the use of standard backpropagation, which has led to challenges in efficient codebook training.In this work, we introduce MetaQuantization (MQ), a novel vector quantization training framework inspired by meta-learning.Our method separates the optimization of the codebook and the auto-encoder into two levels.Furthermore, we introduce a hyper-net to replace the embedding-parameterized codebook, enabling the codebook to be dynamically generated based on the feedback from the auto-encoder.Different from previous VQ objectives, our innovation results in a meta-objective that makes the codebook training task-aware.We validate the effectiveness of MQ with VQVAE and VQGAN architecture on image reconstruction and generation tasks.Experimental results showcase the superior generative performance of MQ, underscoring its potential as a robust alternative to existing VQ methods.
Paperid:165
Authors:Matthew Joseph · Alex Kulesza · Alexander Yu
Abstract:
Abstract:We study the $\ell_2$ mechanism for computing a $d$dimensional statistic with bounded $\ell_2$ sensitivity under approximate differential privacy. Across a range of privacy parameters, we find that the $\ell_2$ mechanism obtains error approaching that of the Laplace mechanism as $d \to 1$ and approaching that of the Gaussian mechanism as $d \to \infty$; however, it dominates both in between.
Paperid:166
Authors:Yihao Xue · Jiping Li · Baharan Mirzasoleiman
Abstract:
Weakto-Strong Generalization (W2SG), where a weak model supervises a stronger one, serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. Promising empirical results revealed that a strong model can surpass its weak supervisor. While recent work has offered theoretical insights into this phenomenon, a clear understanding of the interactions between weak and strong models that drive W2SG remains elusive. We investigate W2SG through a theoretical lens and show that it can be characterized using kernels derived from the principal components of weak and strong models' internal representations. These kernels can be used to define a space that, at a high level, captures what the weak model is unable to learn but is learnable by the strong model. The projection of labels onto this space quantifies how much the strong model falls short of its full potential due to weak supervision. This characterization also provides insights into how certain errors in weak supervision can be corrected by the strong model, regardless of overfitting. Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs.
Paperid:167
Authors:Adel Javanmard · Vahab Mirrokni · Jean Pouget-Abadie
Abstract:
Estimating causal effects from randomized experiments is only possible if participants are willing to disclose their potentially sensitive responses. Differential privacy, a widely used framework for ensuring an algorithm’s privacy guarantees, can encourage participants to share their responses without the risk of deanonymization. However, many mechanisms achieve differential privacy by adding noise to the original dataset, which reduces the precision of causal effect estimation. This introduces a fundamental trade-off between privacy and variance when performing causal analyses on differentially private data.In this work, we propose a new differentially private mechanism, Cluster-DP, which leverages a given cluster structure in the data to improve the privacy-variance trade-off. While our results apply to any clustering, we demonstrate that identifying higher-quality clusters—using a quality measure we define—can reduce the variance penalty while maintaining privacy guarantees. Finally, we evaluate the theoretical and empirical performance of our Cluster-DP algorithm on both real and simulated data, comparing it to common baselines, including two special cases of our algorithm: its unclustered version and a uniform-prior version.
Paperid:168
Authors:Haobo Jiang · Jin Xie · jian Yang · Liang Yu · Jianmin Zheng
Abstract:
In this paper, we propose a novel 3D registration paradigm, Generative Point Cloud Registration, which bridges advanced 2D generative models with 3D matching tasks to enhance registration performance. Our key idea is to generate crossview consistent image pairs that are well-aligned with the source and target point clouds, enabling geometric-color feature fusion to facilitate robust matching. To ensure high-quality matching, the generated image pair should feature both 2D-3D geometric consistency and cross-view texture consistency. To achieve this, we introduce Match-ControlNet, a matching-specific, controllable 2D generative model. Specifically, it leverages the depth-conditioned generation capability of ControlNet to produce images that are geometrically aligned with depth maps derived from point clouds, ensuring 2D-3D geometric consistency. Additionally, by incorporating a coupled conditional denoising scheme and coupled prompt guidance, Match-ControlNet further promotes cross-view feature interaction, guiding texture consistency generation. Our generative 3D registration paradigm is general and could be seamlessly integrated into various registration methods to enhance their performance. Extensive experiments on 3DMatch and ScanNet datasets verify the effectiveness of our approach.
Paperid:169
Authors:Wei Xiao · Johnson Tsun-Hsuan Wang · Chuang Gan · Daniela Rus
Abstract:
Safe learning is central to AIenabled robots where a single failure may lead to catastrophic results. Existing safe learning methods are not scalable, inefficient and hard to train, and tend to generate unstable signals under noisy inputs that are challenging to be deployed for robots. To address these challenges, we propose Adaptive explicit-Barrier Net (ABNet) in which barriers explicitly show up in the closed-form model that guarantees safety. The ABNet has the potential to incrementally scale toward larger safe foundation models. Each head of ABNet could learn safe control policies from different features and focuses on specific part of the observation. In this way, we do not need to directly construct a large model for complex tasks, which significantly facilitates the training of the model while ensuring its stable output. Most importantly, we can still formally prove the safety guarantees of the ABNet. We demonstrate the efficiency and strength of ABNet in 2D robot obstacle avoidance, safe robot manipulation, and vision-based end-to-end autonomous driving, with results showing much better robustness and guarantees over existing models.
Paperid:170
Authors:Gal Dalal · Assaf Hallak · Gugan Chandrashekhar Mallika Thoppe · Shie Mannor · Gal Chechik
Abstract:
Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce SoftTreeMaxa generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze SoftTreeMax and explain how tree expansion helps to reduce its gradient variance. We prove that the variance depends on the chosen tree-expansion policy. Specifically, we show that the closer the induced transitions are to being state-independent, the stronger the variance decay. With approximate forward models, we prove that the resulting gradient bias diminishes with the approximation error while retaining the same variance reduction. Ours is the first result to bound the gradient bias for an approximate model. In a practical implementation of SoftTreeMax we utilize a parallel GPU-based simulator for fast and efficient tree expansion. Using this implementation in Atari, we show that SoftTreeMax reduces the gradient variance by three orders of magnitude. This leads to better sample complexity and improved performance compared to distributed PPO.
Paperid:171
Authors:Ignacio Peis · Batuhan Koyuncu · Isabel Valera · Jes Frellsen
Abstract:
We introduce a novel generative framework for functions by integrating Implicit Neural Representations (INRs) and Transformerbased hypernetworks into latent variable models. Unlike prior approaches that rely on MLP-based hypernetworks with scalability limitations, our method employs a Transformer-based decoder to generate INR parameters from latent variables, addressing both representation capacity and computational efficiency. Our framework extends latent diffusion models (LDMs) to INR generation by replacing standard decoders with a Transformer-based hypernetwork, which can be trained either from scratch or via hyper-transforming—a strategy that fine-tunes only the decoder while freezing the pre-trained latent space. This enables efficient adaptation of existing generative models to INR-based representations without requiring full retraining. We validate our approach across multiple modalities, demonstrating improved scalability, expressiveness, and generalization over existing INR-based generative models. Our findings establish a unified and flexible framework for learning structured function representations.
Paperid:172
Authors:Anne Ouyang · Simon Guo · Simran Arora · Alex Zhang · William Hu · Christopher Re · Azalia Mirhoseini
Abstract:
Abstract:Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a timeconsuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce **KernelBench**, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric $\text{fast}_p$, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold $p$ over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20\% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold $p$.
Paperid:173
Authors:Abdelhakim Benechehab · Vasilii Feofanov · Giuseppe Paolo · Albert Thomas · Maurizio Filippone · Balázs Kégl
Abstract:
Pretrained foundation models (FMs) have shown exceptional performance in univariate time series forecasting tasks. However, several practical challenges persist, including managing intricate dependencies among features and quantifying uncertainty in predictions. This study aims to tackle these critical limitations by introducingadapters—feature-space transformations that facilitate the effective use of pre-trained univariate time series FMs for multivariate tasks. Adapters operate by projecting multivariate inputs into a suitable latent space and applying the FM independently to each dimension in a zero-shot manner. Inspired by the literature on representation learning and partially stochastic Bayesian neural networks, we present a range of adapters and optimization/inference strategies. Experiments conducted on both synthetic and real-world datasets confirm the efficacy of adapters, demonstrating substantial enhancements in forecasting accuracy and uncertainty quantification compared to baseline methods. Our framework,AdaPTS, positions adapters as a modular, scalable, and effective solution for leveraging time series FMs in multivariate contexts, thereby promoting their wider adoption in real-world applications.
Paperid:174
Authors:Shogo Iwazaki · Shion Takeno
Abstract:
We study the Gaussian process (GP) bandit problem, whose goal is to minimize regret under an unknown reward function lying in some reproducing kernel Hilbert space (RKHS). The maximum posterior variance analysis is vital in analyzing nearoptimal GP bandit algorithms such as maximum variance reduction (MVR) and phased elimination (PE).Therefore, we first show the new upper bound of the maximum posterior variance, which improves the dependence of the noise variance parameters of the GP. By leveraging this result, we refine the MVR and PE to obtain (i) a nearly optimal regret upper bound in the noiseless setting and (ii) regret upper bounds that are optimal with respect to the RKHS norm of the reward function. Furthermore, as another application of our proposed bound, we analyze the GP bandit under the time-varying noise variance setting, which is the kernelized extension of the linear bandit with heteroscedastic noise. For this problem, we show that MVR and PE-based algorithms achieve noise variance-dependent regret upper bounds, which matches our regret lower bound.
Paperid:175
Authors:Hongwei Li · Yuheng Tang · Shiqi Wang · Wenbo Guo
Abstract:
Recent research builds various patching agents that combine large language models (LLMs) with nonML tools and achieve promising results on the state-of-the-art (SOTA) software patching benchmark, SWE-Bench. Based on how to determine the patching workflows, existing patching agents can be categorized as agent-based planning methods, which rely on LLMs for planning, and human-based planning methods, which follow a pre-defined workflow.At a high level, agent-based planning methods achieve high patching performance but with a high cost and limited stability. Human-based planning methods, on the other hand, are more stable and efficient but have key workflow limitations that compromise their patching performance.In this paper, we propose PatchPilot, an agentic patcher that strikes a balance between patching efficacy, stability, and cost-efficiency. PatchPilot proposes a novel human-based planning workflow with five components: reproduction, localization, generation, validation, and refinement (where refinement is unique to PatchPilot).We introduce novel and customized designs to each component to optimize their effectiveness and efficiency. Through extensive experiments on the SWE-Bench benchmarks, PatchPilot shows a superior performance than existing open-source methods while maintaining low cost (less than 1\$ per instance) and ensuring higher stability.We also conduct a detailed ablation study to validate the key designs in each component.
Paperid:176
Authors:Bariscan Kurtkaya · Fatih Dinc · Mert Yuksekgonul · Marta Blanco-Pozo · Ege Cirakman · Mark Schnitzer · Yucel Yemez · Hidenori Tanaka · Yuan · Nina Miolane
Abstract:
Shortterm memory is essential for cognitive processing, yet our understanding of its neural mechanisms remains unclear. A key focus in neuroscience has been the study of sequential activity patterns, where neurons fire one after another within large networks, to explain how information is maintained. While recurrent connections were shown to drive sequential dynamics, a mechanistic understanding of this process still remains unknown. In this work, we first introduce two unique mechanisms that can subserve short-term memory: slow-point manifolds generating direct sequences or limit cycles providing temporally localized approximations. Then, through analytical models, we identify fundamental properties that govern the selection of these mechanisms, \textit{i.e.}, we derive theoretical scaling laws for critical learning rates as a function of the delay period length, beyond which no learning is possible. We empirically verify these observations by training and evaluating more than 35,000 recurrent neural networks (RNNs) that we will publicly release upon publication. Overall, our work provides new insights into short-term memory mechanisms and proposes experimentally testable predictions for systems neuroscience.
Paperid:177
Authors:Lutfi Erdogan · Hiroki Furuta · Sehoon Kim · Nicholas Lee · Suhong Moon · Gopala Anumanchipalli · Kurt Keutzer · Amir Gholaminejad
Abstract:
Large language models (LLMs) have shown remarkable advancements in enabling language agents to tackle complex, multistep, long-horizon tasks such as web navigation. However, existing approaches struggle with challenges like misalignment between high-level user queries and low-level executable actions, difficulty in maintaining consistency over long-horizon tasks, and a lack of grounding in specific environments. To address these issues, we propose Plan-and-Act, a novel framework that incorporates explicit planning into LLM-based agents. Plan-and-Act consists of a Planner model that generates high-level plans to achieve user goals and an Executor model that translates these plans into precise, environment-specific actions. By explicitly separating planning from execution, Plan-and-Act improves alignment between the high-level reasoning and low-level actions, which enables consistent decision-making and adaptability to changes in the environment. To train the Planner effectively, we introduce a synthetic data generation method that annotates ground-truth trajectories with feasible plans, augmented by diverse and extensive examples to enhance generalization. We evaluate Plan-and-Act in the WebArena environment, demonstrating a SOTA 54% success rate on the WebArena-Lite benchmark.
Paperid:178
Authors:Sam Gijsen · Kerstin Ritter
Abstract:
Multimodal language modeling has enabled breakthroughs for representation learning, yet remains unexplored in the realm of functional brain data for pathology detection. This paper pioneers EEGlanguage models (ELMs) trained on clinical reports and 15000 EEGs. We propose to combine multimodal alignment in this novel domain with timeseries cropping and text segmentation, enabling an extension based on multiple instance learning to alleviate misalignment between irrelevant EEG or text segments. Our multimodal models significantly improve pathology detection compared to EEG-only models across four evaluations and for the first time enable zero-shot classification as well as retrieval of both neural signals and reports. In sum, these results highlight the potential of ELMs, representing significant progress for clinical applications.
Paperid:179
Authors:Yinyan Bu · Jiajie Yu · Kai Zheng · Xinyu Zhang · Piya Pal
Abstract:
We address the challenge of achieving angular superresolution in multi-antenna radar systems that are widely used for localization, navigation, and automotive perception. A multi-antenna radar achieves very high resolution by computationally creating a large virtual sensing system using very few physical antennas. However, practical constraints imposed by hardware, noise, and a limited number of antennas can impede its performance. Conventional supervised learning models that rely on extensive pre-training with large datasets, often exhibit poor generalization in unseen environments. To overcome these limitations, we propose NEAR, an untrained implicit neural representation (INR) framework that predicts radar responses at unseen locations from sparse measurements, by leveraging latent harmonic structures inherent in radar wave propagation. We establish new theoretical results linking antenna array response to expressive power of INR architectures, and develop a novel physics-informed and latent geometry-aware regularizer. Our approach integrates classical signal representation with modern implicit neural learning, enabling high-resolution radar sensing that is both interpretable and generalizable. Extensive simulations and real-world experiments using radar platforms demonstrate NEAR's effectiveness and its ability to adapt to unseen environments.
Paperid:180
Authors:Wei Qu · Jiawei Guan · Rui Ma · kezhai · weikun wu · haobo Wang
Abstract:
Abstract:We introduce Pallatom, an innovative protein generation model capable of producing protein structures with allatom coordinates. Pallatom directly learns and models the joint distribution $P(\textit{structure}, \textit{seq})$ by focusing on $P(\textit{all-atom})$, effectively addressing the interdependence between sequence and structure in protein generation. To achieve this, we propose a novel network architecture specifically designed for all-atom protein generation. Our model employs a dual-track framework that tokenizes proteins into token-level and atomic-level representations, integrating them through a multi-layer decoding process with "traversing" representations and recycling mechanism. We also introduce the $\texttt{atom14}$ representation method, which unifies the description of unknown side-chain coordinates, ensuring high fidelity between the generated all-atom conformation and its physical structure. Experimental results demonstrate that Pallatom excels in key metrics of protein design, including designability, diversity, and novelty, showing significant improvements across the board. Our model not only enhances the accuracy of protein generation but also exhibits excellent training efficiency, paving the way for future applications in larger and more complex systems.
Paperid:181
Authors:Yen-Jung Hsu · Ming-Yi Hong · Miao-Chen Chiang · Che Lin
Abstract:
Sequential recommendation in ecommerce utilizes users' anonymous browsing histories to personalize product suggestions without relying on private information. Existing item ID-based methods and multimodal models often overlook the temporal alignment of modalities like textual descriptions, visual content, and prices in user browsing sequences. To address this limitation, this paper proposes the Multimodal Time-aligned Shared Token Recommender (MTSTRec), a transformer-based framework with a single time-aligned shared token per product for efficient cross-modality fusion. MTSTRec preserves the distinct contributions of each modality while aligning them temporally to better capture user preferences. Extensive experiments demonstrate that MTSTRec achieves state-of-the-art performance across multiple sequential recommendation benchmarks, significantly improving upon existing multimodal fusion.
Paperid:182
Authors:Kai Liu · Kaicheng Yang · Zheng Chen · Zhiteng Li · Yong Guo · Wenbo Li · Linghe Kong · Yulun Zhang
Abstract:
While superresolution (SR) methods based on diffusion models (DM) have demonstrated inspiring performance, their deployment is impeded due to the heavy request of memory and computation. Recent researchers apply two kinds of methods to compress or fasten the DM. One is to compress the DM into 1-bit, aka binarization, alleviating the storage and computation pressure. The other distills the multi-step DM into only one step, significantly speeding up inference process. Nonetheless, it remains impossible to deploy DM to resource-limited edge devices. To address this problem, we propose BiMaCoSR, which combines binarization and one-step distillation to obtain extreme compression and acceleration. To prevent the catastrophic collapse of the model caused by binarization, we proposed sparse matrix branch (SMB) and low rank matrixbranch (LRM). Both auxiliary branches pass the full-precision (FP) information but in different ways. SMB absorbs the extreme values and its output is high rank, carrying abundant FP information. Whereas, the design of LRMB is inspired by LoRA and is initialized with the top r SVD components, outputting low rank representation. The computation and storage overhead of our proposed branches can be safely ignored. Comprehensive comparison experiments are conducted to exhibit BiMaCoSR outperforms current state-of-the-art binarization methods and gains competitive performance compared with FP one-step model. BiMaCoSR achieves a 23.8x compression ratio and a 27.4x speedup ratio compared to FP counterpart. Our code will be released soon.
Paperid:183
Authors:Jingyang Ke · Feiyang Wu · Jiyi Wang · Jeffrey Markowitz · Anqi Wu
Abstract:
Traditional approaches to studying decisionmaking in neuroscience focus on simplified behavioral tasks where animals perform repetitive, stereotyped actions to receive explicit rewards. While informative, these methods constrain our understanding of decision-making to short timescale behaviors driven by explicit goals. In natural environments, animals exhibit more complex, long-term behaviors driven by intrinsic motivations that are often unobservable. Recent works in time-varying inverse reinforcement learning (IRL) aim to capture shifting motivations in long-term, freely moving behaviors. However, a crucial challenge remains: animals make decisions based on their history, not just their current state. To address this, we introduce SWIRL (SWitching IRL), a novel framework that extends traditional IRL by incorporating time-varying, history-dependent reward functions. SWIRL models long behavioral sequences as transitions between short-term decision-making processes, each governed by a unique reward function. SWIRL incorporates biologically plausible history dependency to capture how past decisions and environmental contexts shape behavior, offering a more accurate description of animal decision-making. We apply SWIRL to simulated and real-world animal behavior datasets and show that it outperforms models lacking history dependency, both quantitatively and qualitatively. This work presents the first IRL model to incorporate history-dependent policies and rewards to advance our understanding of complex, naturalistic decision-making in animals.
Paperid:184
Authors:Ethan Rathbun · Christopher Amato · Alina Oprea
Abstract:
Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against trainingtime, backdoor poisoning attacks. The objectives of these attacks are twofold: induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks assume arbitrary control over the agent's rewards, inducing values far outside the environment's natural constraints. This results in brittle attacks that fail once the proper reward constraints are enforced. Thus, in this work we propose a new class of backdoor attacks against DRL which are the first to achieve state of the art performance under strict reward constraints. These ``inception'' attacks manipulate the agent's training data -- inserting the trigger into prior observations and replacing high return actions with those of the targeted adversarial behavior. We formally define these attacks and prove they achieve both adversarial objectives against arbitrary Markov Decision Processes (MDP). Using this framework we devise an online inception attack which achieves an 100% attack success rate on multiple environments under constrained rewards while minimally impacting the agent's task performance.
Paperid:185
Authors:Zhaohe Liao · Jiangtong Li · Siyu Sun · Qingyang Liu · Fengshun Xiao · Tianjiao Li · Qiang Zhang · Guang Chen · Li Niu · Changjun Jiang · Liqing Zhang
Abstract:
Video QuestionAnswering (VideoQA) remains challenging in achieving advanced cognitive reasoning due to the uncontrollable and opaque reasoning processes in existing Multimodal Large Language Models (MLLMs). To address this issue, we propose a novel Language-centric Tree Reasoning (LTR) framework that targets on enhancing the reasoning ability of models. In detail, it recursively divides the original question into logically manageable parts and conquers them piece by piece, enhancing the reasoning capabilities and interpretability of existing MLLMs. Specifically, in the first stage, the LTR focuses on language to recursively generate a language-centric logical tree, which gradually breaks down the complex cognitive question into simple perceptual ones and plans the reasoning path through a RAG-based few-shot approach. In the second stage, with the aid of video content, the LTR performs bottom-up logical reasoning within the tree to derive the final answer along with the traceable reasoning path. Experiments across 11 VideoQA benchmarks demonstrate that our LTR framework significantly improves both accuracy and interpretability compared to state-of-the-art MLLMs. To our knowledge, this is the first work to implement a language-centric logical tree to guide MLLM reasoning in VideoQA, paving the way for language-centric video understanding from perception to cognition.
Paperid:186
Authors:Steve Hanneke · Qinglin Meng · Amirreza Shaeiri
Abstract:
We study the problem of multiclass classification when the number of labels can be unbounded within the PAC learning framework. Our main contribution is a theory that demonstrates asimpleandelegantagnostic to realizable reduction for this framework. This resolves an open problem raised by the recent work of (Hopkins et al., 2022). Notably, our result is the firstrepresentation preservingmulticlass agnostic to realizable reduction, in contrast with the compression based approach of the work of (David et al., 2017). Furthermore, our main theorem is stated in an abstract framework, called ``Unified PAC Learning'', which encompasses a range of frameworks, including multiclass PAC learning, list PAC learning, and multilabel PAC learning. In addition, we explore representation preserving reductions to the realizable setting for two noise models, namely Massart noise and Tsybakov noise, in the multiclass PAC learning framework. We believe our technique may find other applications in ensuing studies of theoretical machine learning.
Paperid:187
Authors:Tuan Truong · Ngọc Quân Phạm · Quyen Tran · Tan Nguyen · Dinh Phung · Trung Le
Abstract:
We introduce Interactive Bayesian Distributional Robustness (IBDR), a novel Bayesian inference framework that allows modeling the interactions between particles, thereby enhancing ensemble quality through increased particle diversity. IBDR is grounded in a generalized theoretical framework that connects the distributional population loss with the approximate posterior, motivating a practical dual optimization procedure that enforces distributional robustness while fostering particle diversity. We evaluate IBDR's performance against various baseline methods using the VTAB1K benchmark and the common reasoning language task. The results consistently show that IBDR outperforms these baselines, underscoring its effectiveness in real-world applications. Our code is publicly available at https://anonymous.4open.science/r/DRO-1C12/
Paperid:188
Authors:Ruiquan Huang · Yingbin LIANG · Jing Yang
Abstract:
Abstract:Language recognition tasks are fundamental in natural language processing (NLP) and have been widely used to benchmark the performance of large language models (LLMs). These tasks also play a crucial role in explaining the working mechanisms of transformers. In this work, we focus on two representative tasks in the category of regular language recognition, known as 'even pairs' and 'parity check', the aim of which is to determine whether the occurrences of certain subsequences in a given sequence are even. Our goal is to explore how a onelayer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks by theoretically analyzing its training dynamics under gradient descent. While even pairs can be solved directly by a one-layer transformer, parity check need to be solved by integrating Chain-of-Thought (CoT), either into the inference stage of a transformer well-trained for the even pairs task, or into the training of a one-layer transformer. For both problems, our analysis shows that the joint training of attention and linear layers exhibits two distinct phases. In the first phase, the attention layer grows rapidly, mapping data sequences into separable vectors. In the second phase, the attention layer becomes stable, while the linear layer grows logarithmically and approaches in direction to a max-margin hyperplane that correctly separates the attention layer outputs into positive and negative samples, and the loss decreases at a rate of $O(1/t)$. Our experiments validate those theoretical results.
Paperid:189
Authors:Xu Wan · Wenyue Xu · Chao Yang · Mingyang Sun
Abstract:
Recent advancements in Large Language Models (LLMs) and Reinforcement Learning (RL) have shown significant promise in decisionmaking tasks. Nevertheless, for large-scale industrial decision problems, both approaches face distinct challenges: LLMs lack real-time long-sequence decision-making capabilities, while RL struggles with sample efficiency in vast action spaces. To bridge this gap, we proposeAgentsCo-Evolution (ACE), a synergistic framework between LLMs and RL agent for large-scale decision-making scenarios. ACE introduces a dual-role trajectory refinement mechanism where LLMs act as both Policy Actor and Value Critic during RL's training: the Actor refines suboptimal actions via multi-step reasoning and environment validation, while the Critic performs temporal credit assignment through trajectory-level reward shaping. Concurrently, RL agent enhance LLMs' task-specific decision-making via prioritized experience replay.Through extensive experiments across multiple power grid operation challenges with action spaces exceeding 60K discrete actions, ACE demonstrates superior performance over existing RL methods and LLM-based methods.
Paperid:190
Authors:Johannes Hertrich · Robert Gruhlke
Abstract:
In order to sample from an unnormalized probability density function, we propose to combine continuous normalizing flows (CNFs) with rejectionresampling steps based on importance weights. We relate the iterative training of CNFs with regularized velocity fields to a JKO scheme and prove convergence of the involved velocity fields to the velocity field of the Wasserstein gradient flow (WGF). The alternation of local flow steps and non-local rejection-resampling steps allows to overcome local minima or slow convergence of the WGF for multimodal distributions. Since the proposal of the rejection step is generated by the model itself, they do not suffer from common drawbacks of classical rejection schemes. The arising model can be trained iteratively, reduces the reverse Kullback-Leibler (KL) loss function in each step, allows to generate iid samples and moreover allows for evaluations of the generated underlying density. Numerical examples show that our method yields accurate results on various test distributions including high-dimensional multimodal targets and outperforms the state of the art in almost all cases significantly.
Paperid:191
Authors:Guangyi Wang · Wei Peng · lijiang Li · Wenyu Chen · Yuren Cai · Song-Zhi Su
Abstract:
While powerful for generation, Diffusion Probabilistic Models (DPMs) face slow sampling challenges, for which various distillationbased methods have been proposed. However, they typically require significant additional training costs and model parameter storage, limiting their practicality. In this work, we propose PCA-based Adaptive Search (PAS), which optimizes existing solvers for DPMs with minimal additional costs. Specifically, we first employ PCA to obtain a few basis vectors to span the high-dimensional sampling space, which enables us to learn just a set of coordinates to correct the sampling direction; furthermore, based on the observation that the cumulative truncation error exhibits an ``S''-shape, we design an adaptive search strategy that further enhances the sampling efficiency and reduces the number of stored parameters to approximately 10. Extensive experiments demonstrate that PAS can significantly enhance existing fast solvers in a plug-and-play manner with negligible costs. E.g., on CIFAR10, PAS optimizes DDIM's FID from 15.69 to 4.37 (NFE=10) using only 12 parameters and sub-minute training on a single A100 GPU. Code will be available.
Paperid:192
Authors:Marco Cusumano-Towner · David Hafner · Alexander Hertzberg · Brody Huval · Aleksei Petrenko · Eugene Vinitsky · Erik Wijmans · Taylor Killian · Stuart Bowers · Ozan Sener · Philipp Kraehenbuehl · Vladlen Koltun
Abstract:
Selfplay has powered breakthroughs in two-player and multi-player games. Here we show that self-play is a surprisingly effective strategy in another domain. We show that robust and naturalistic driving emerges entirely from self-play in simulation at unprecedented scale -- 1.6 billion km of driving. This is enabled by Gigaflow, a batched simulator that can synthesize and train on 42 years of subjective driving experience per hour on a single 8-GPU node. The resulting policy achieves state-of-the-art performance on three independent autonomous driving benchmarks. The policy outperforms the prior state of the art when tested on recorded real-world scenarios, amidst human drivers, without ever seeing human data during training. The policy is realistic when assessed against human references and achieves unprecedented robustness, averaging 17.5 years of continuous driving between incidents in simulation.
Paperid:193
Authors:Aojun Lu · Tao Feng · Hangjie Yuan · Yanan Sun
Abstract:
The quest for Continual Learning (CL) seeks to empower neural networks with the ability to learn and adapt incrementally. Central to this pursuit is addressing the stabilityplasticity dilemma, which involves striking a balance between two conflicting objectives: preserving previously learned knowledge and acquiring new knowledge. Existing studies have proposed numerous CL methods to achieve this trade-off. However, these methods often overlook the impact of basic architecture on stability and plasticity, thus the trade-off is limited to the parameter level. In this paper, we delve into the conflict between stability and plasticity at the architectural level. We reveal that under an equal parameter constraint, deeper networks exhibit better plasticity, while wider networks are characterized by superior stability. To address this architectural-level dilemma, we introduce a novel framework denoted Dual-Architecture (Dual-Arch), which serves as a plug-in component for CL. This framework leverages the complementary strengths of two distinct and independent networks: one dedicated to plasticity and the other to stability. Each network is designed with a specialized and lightweight architecture, tailored to its respective objective. Extensive experiments across datasets and CL methods demonstrate that Dual-Arch can enhance the performance of existing CL methods while being up to87\%more compact in terms of parameters than the baselines.
Paperid:194
Authors:Nearchos Potamitis · Lars Klein · Roland Aydin · Robert West · Caglar Gulcehre · Akhil Arora
Abstract:
Abstract:While numerous frameworks have been developed to enhance the reasoning abilities of large language models (LLMs), there is a scarcity of methods that effectively balance the tradeoff between cost and quality. In this paper, we introduce Fleet of Agents (FoA), a novel and intuitive yet principled framework utilizing LLMs as agents to navigate through dynamic tree searches, employing a genetic-type particle filtering approach. FoA spawns a multitude of agents, each exploring the search space autonomously, followed by a selection phase where resampling based on a heuristic value function optimizes the balance between exploration and exploitation. This mechanism enables dynamic branching, adapting the exploration strategy based on discovered solutions. We conduct extensive experiments on three benchmark tasks, “Game of 24”, “Mini-Crosswords”, and “WebShop”, utilizing four different LLMs, “GPT-3.5”, “GPT-4”, “LLaMA3.2-11B”, and “LLaMA3.2-90B”. On average across all tasks and LLMs, FoA obtains a quality improvement of $\simeq5$% while requiring only $\simeq40$% of the cost of previous SOTA methods. Notably, our analyses reveal that (1) FoA achieves the best cost-quality trade-off among all benchmarked methods and (2) FoA + LLaMA3.2-11B surpasses the Llama3.2-90B model. FoA is publicly available at https://anonymous.4open.science/r/FoA-4D83
Paperid:195
Authors:Woohyun Lee · Hogun Park
Abstract:
Defending Graph Neural Networks (GNNs) against adversarial attacks requires balancing accuracy and robustness, a tradeoff often mishandled by traditional methods like adversarial training that intertwine these conflicting objectives within a single classifier. To overcome this limitation, we propose a self-supervised adversarial purification framework. We separate robustness from the classifier by introducing a dedicated purifier, which cleanses the input data before classification. In contrast to prior adversarial purification methods, we propose GPR-GAE, a novel graph auto-encoder, as a specialized purifier trained with a self-supervised strategy, adapting to diverse graph structures in a data-driven manner. Utilizing multiple Generalized PageRank (GPR) filters, GPR-GAE captures diverse structural representations for robust and effective purification. Our multi-step purification process further facilitates GPR-GAE to achieve precise graph recovery and robust defense against structural perturbations. Experiments across diverse datasets and attack scenarios demonstrate state-of-the-art robustness of GPR-GAE, showcasing it as an independent plug-and-play purifier for GNN classifiers.
Paperid:196
Authors:Hila Chefer · Uriel Singer · Amit Zohar · Yuval Kirstain · Adam Polyak · Yaniv Taigman · Lior Wolf · Shelly Sheynin
Abstract:
Despite tremendous recent progress, generative video models still struggle to capture realworld motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence.To address this, we introduceVideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learna joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduceInner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal.Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model.VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations.These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
Paperid:197
Authors:Ostmeier · Brian Axelrod · Maya Varma · Michael Moseley · Akshay Chaudhari · Curtis Langlotz
Abstract:
Transformer architectures rely on position encodings to capture token dependencies. Rotary Position Encoding (RoPE) has emerged as a popular choice in language models due to its efficient encoding of relative position information through keyquery rotations. However, RoPE faces significant limitations beyond language processing: it is constrained to one-dimensional sequence data and, even with learnable phases, offers limited representational capacity.We address these challenges with Lie Relative Encodings (LieRE), which replaces RoPE's block-2D rotation matrix with a learned, dense, high-dimensional rotation matrix of variable density. Through extensive evaluation on three image datasets across 2D and 3D classification tasks, LieRE achieves 2\% relative improvement over state-of-the-art baselines on 2D tasks and 1.5\% on 3D tasks, while demonstrating superior generalization to higher resolutions. Our implementation is computationally efficient, with results reproducible on 4 A100 GPUs in 30 minutes on CIFAR100, and we release our code to facilitate further research.
Paperid:198
Authors:Ian Gemp · Andreas Haupt · Luke Marris · Siqi Liu · Georgios Piliouras
Abstract:
Behavioral diversity, expert imitation, fairness, safety goals and others give rise to preferences in sequential decision making domains that do not decompose additively across time. We introduce the class of convex Markov games that allow general convex preferences over occupancy measures. Despite infinite time horizon and strictly higher generality than Markov games, pure strategy Nash equilibria exist. Furthermore, equilibria can be approximated empirically by performing gradient descent on an upper bound of exploitability. Our experiments reveal novel solutions to classic repeated normalform games, find fair solutions in a repeated asymmetric coordination game, and prioritize safe long-term behavior in a robot warehouse environment. In the prisoner's dilemma, our algorithm leverages transient imitation to find a policy profile that deviates from observed human play only slightly, yet achieves higher per-player utility while also being three orders of magnitude less exploitable.
Paperid:199
Authors:Nikita Tsoy · Ivan Kirev · Negin Rahimiyazdi · Nikola Konstantinov
Abstract:
Performativity, the phenomenon where outcomes are influenced by predictions, is particularly prevalent in social contexts where individuals strategically respond to a deployed model. In order to preserve the high accuracy of machine learning models under distribution shifts caused by performativity, Perdomo et al. (2020) introduced the concept of performative risk minimization (PRM). While this framework ensures model accuracy, it overlooks the impact of the PRM on the underlying distributions and the predictions of the model. In this paper, we initiate the analysis of the impact of PRM, by studying performativity for a sequential performative risk minimization problem with binary random variables and linear performative shifts. We formulate two natural measures of impact. In the case of full information, where the distribution dynamics are known, we derive explicit formulas for the PRM solution and our impact measures. In the case of partial information, we provide performativeaware statistical estimators, as well as simulations. Our analysis contrasts PRM to alternatives that do not model data shift and indicates that PRM can have amplified side effects compared to such methods.
Paperid:200
Authors:Junyan Liu · Lillian Ratliff
Abstract:
Abstract:We study the repeated principalagent bandit game, where the principal indirectly explores an unknown environment by incentivizing an agent to play arms. Unlike prior work that assumes a greedy agent with full knowledge of reward means, we consider a self-interested learning agent who iteratively updates reward estimates and may explore arbitrarily with some probability.As a warm-up, we first consider a self-interested learning agent without exploration.We propose algorithms for both i.i.d. and linear reward settings with bandit feedback in a finite horizon $T$, achieving regret bounds of $\widetilde{O}(\sqrt{T})$ and $\widetilde{O}(T^{\frac{2}{3}})$, respectively. Specifically, these algorithms rely on a novel elimination framework coupled with new search algorithms which accommodate the uncertainty from the agent's learning behavior. We then extend the framework to handle an exploratory learning agent and develop an algorithm to achieve a $\widetilde{O}(T^{\frac{2}{3}})$ regret bound in i.i.d. reward setup by enhancing the robustness of our elimination framework to the potential agent exploration. Finally, when our agent model reduces to that in (Dogan et al., 2023a), we propose an algorithm based on our robust framework, which achieves a $\widetilde{O}(\sqrt{T})$ regret bound, significantly improving upon their $\widetilde{O}(T^{\frac{11}{12}})$ bound.
Paperid:201
Authors:Irmak Sivgin · Sara Fridovich-Keil · Gordon Wetzstein · Mert Pilanci
Abstract:
Volume parameterizations abound in recent literature, encompassing methods from classic voxel grids to implicit neural representations. While implicit representations offer impressive capacity and improved memory efficiency compared to voxel grids, they traditionally require training through nonconvex optimization, which can be slow and sensitive to initialization and hyperparameters. We introduce GAPlanes, a novel family of implicit neural volume representations inspired by Geometric Algebra that can be trained using convex optimization, addressing the limitations of nonconvex methods. GA-Planes models generalize many existing representations including any combination of features stored in tensor basis elements followed by a neural feature decoder, and can be adapted to convex or nonconvex training as needed for various inverse problems. In the 2D setting, we prove GA-Planes models are equivalent to a low-rank plus low-resolution matrix factorization that outperforms the classic low-rank plus sparse decomposition for fitting a natural image. In 3D, GA-Planes models exhibit competitive expressiveness, model size, and optimizability across tasks such as radiance field reconstruction, 3D segmentation, and video segmentation.
Paperid:202
Authors:Chenxi Wang · Linxiao Yang · Zhixian Wang · Liang Sun · Yi Wang
Abstract:
Diffusion models, known for their generative ability, have recently been adapted to time series analysis. Most pioneering works rely on the standard isotropic diffusion, treating each time step and the entire frequency spectrum identically. However, it may not be suitable for time series, which often have more informative lowfrequency components. We empirically found that direct application of standard diffusion to time series may cause gradient contradiction during training, due to the rapid decrease of low-frequency information in the diffusion process. To this end, we proposed a novel time series diffusion model, MA-TSD, which utilizes the moving average, a natural low-frequency filter, as the forward transition. Its backward process is accelerable like DDIM and can be further considered a time series super-resolution. Our experiments on various datasets demonstrated MA-TSD's superior performance in time series forecasting and super-resolution tasks.
Paperid:203
Authors:Lorenzo Lucchese · Mikko Pakkanen · Almut E. D. Veraart
Abstract:
The expected signature maps a collection of data streams to a lower dimensional representation, with a remarkable property: the resulting feature tensor can fully characterize the data generating distribution. This "modelfree" embedding has been successfully leveraged to build multiple domain-agnostic machine learning (ML) algorithms for time series and sequential data. The convergence results proved in this paper bridge the gap between the expected signature's empirical discrete-time estimator and its theoretical continuous-time value, allowing for a more complete probabilistic interpretation of expected signature-based ML methods. Moreover, when the data generating process is a martingale, we suggest a simple modification of the expected signature estimator with significantly lower mean squared error and empirically demonstrate how it can be effectively applied to improve predictive performance.
Paperid:204
Authors:junmin zhong · Emiliano Quinones Yumbla · Seyed Yousef Soltanian · Ruofan Wu · Wenlong Zhang · Jennie Si
Abstract:
This study proposes an innovative approach to soft exosuitassisted human walking. Inspired by this real-life problem involving a wearable robot, we seek to overcome persistent difficulties in ensuring reliable reinforcement learning (RL)-based methods for physical device control. To address the multitude of challenges including limited data and the lack of a simulator engine for human-robot interaction duringwalking, the demand of reduced computational overhead for real-time deployment,and time-efficient adaptation to achieve personalized control while ensuring human safety, we propose an online Adaptation from an offline Imitating expert Policy (AIP) approach. Our offline learning mimics human expert actions through real human walking demonstrations without robot assistance. The resulted policy is then used to initialize online actor-critic learning, the goal of which is to optimally personalize robot assistance. In addition to being fast and robust, our online RL method also posses important properties such as learning convergence, dynamic stability, and solution optimality. We have successfully demonstrated our simple and robust framework for safe robot control on all five tested human participants,without selectively presenting results. The qualitative performance guarantees provided by our online RL, along with the consistent experimental validation of AIP control, represent the first demonstration of online adaptation for softsuit control personalization and serve as important evidence for the use of online RL in controlling a physical device to solve a real-life problem.
Paperid:205
Authors:Yu Zhou · Bingxuan Li · Mohan Tang · Xiaomeng Jin · Te-Lin Wu · Kuan-Hao Huang · Heng Ji · Kai-Wei Chang · Nanyun Peng
Abstract:
Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pretrained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.
Paperid:206
Authors:Weizhi Gao · Zhichao Hou · Junqi Yin · Feiyi Wang · Linyu Peng · Xiaorui Liu
Abstract:
Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an indepth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our implementation is available at https://anonymous.4open.science/r/MoDiff-78DF/ and will be make publicly available.
Paperid:207
Authors:Chengying Fang · Wenke Huang · Guancheng Wan · Yihao Yang · Mang Ye
Abstract:
Federated Prompt Learning (FPL) adapts pretrained Vision-Language Models (VLMs) to federated learning through prompt tuning, leveraging their transferable representations and strong generalization capabilities. Traditional methods often require uniform prompt lengths for federated aggregation, limiting adaptability to clients with diverse prompt lengths and distribution biases. In this paper, we proposeFederatedPrompt Learning forHeterogeneous ClientAdaptation (FedPHA), a novel framework that combines a fixed-length global prompt for efficient aggregation with local prompts of varying lengths to capture client-specific data characteristics. Additionally, FedPHA designs Singular Value Decomposition (SVD) based projection and bidirectional alignment to disentangle global conflicts arising from client heterogeneity, ensuring that personalized client tasks effectively utilize non-harmful global knowledge. This approach ensures that global knowledge improves model generalization while local knowledge preserves local optimization. Experimental results validate the effectiveness of FedPHA in achieving a balance between global and personalized knowledge in federated learning scenarios.
Paperid:208
Authors:Yixin Chen · Wenjing Chen · Alan Kuhnle
Abstract:
Abstract:With the rapid growth of data in modern applications, parallel combinatorial algorithms for maximizing nonmonotone submodular functions have gained significant attention. The state-of-the-art approximation ratio of $1/e$ is currently achieved only by a continuous algorithm (Ene & Nguyen, 2020) with adaptivity $\mathcal O (log(n))$. In this work, we focus on size constraints and propose a $(1/4 − \varepsilon)$-approximation algorithm with high probability for this problem, as well as the first randomized parallel combinatorial algorithm achieving a $1/e − \varepsilon$ approximation ratio, which bridges the gap between continuous and combinatorial approaches. Both algorithms achieve $\mathcal O (log(n) log(k))$ adaptivity and $\mathcal O (n log(n) log(k))$ query complexity. Empirical results show our algorithms achieve competitive objective values, with the first algorithm particularly efficient in queries.
Paperid:209
Authors:Hancheng Min · Rene Vidal
Abstract:
Abstract:Deep learningbased classifiers are known to be vulnerable to adversarial attacks. Existing methods for defending against such attacks require adding a defense mechanism or modifying the learning procedure (e.g., by adding adversarial examples). This paper shows that for certain data distributions one can learn a provably robust classifier using standard learning methods and without adding a defense mechanism. More specifically, this paper addresses the problem of finding a robust classifier for a binary classification problem in which the data comes from an isotropic mixture of Gaussians with orthonormal cluster centers. First, we characterize the largest $\ell_2$-attack any classifier can defend against while maintaining high accuracy, and show the existence of optimal robust classifiers achieving this maximum $\ell_2$-robustness. Next, we show that given data sampled from the orthonormal Gaussian mixture model, gradient flow on a two-layer network with a polynomial ReLU activation and without adversarial examples provably finds an optimal robust classifier.
Paperid:210
Authors:Guangzhi Sun · Yudong Yang · Jimin Zhuang · Changli Tang · Yixuan Li · Wei Li · Zejun MA · Chao Zhang
Abstract:
While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding. This paper proposes videoSALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs across scenarios such as standup comedy, academic presentations, and synthetic video detection. video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks. Besides, pDPO achieves 6-8% improvements compared to the supervised fine-tuning model on RivaBench. Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection capabilities.
Paperid:211
Authors:Antoine Wehenkel · Juan L. Gamella · Ozan Sener · Jens Behrmann · Guillermo Sapiro · Jörn Jacobsen · Marco Cuturi
Abstract:
Driven by steady progress in deep generative modeling, simulationbased inference (SBI) has emerged as the workhorse for inferring the parameters of stochastic simulators. However, recent work has demonstrated that model misspecification can harm SBI's reliability, preventing its adoption in important applications where only misspecified simulators are available.This work introduces robust posterior estimation (RoPE), a framework that overcomes model misspecification with a small real-world calibration set of ground truth parameter measurements.We formalize the misspecification gap as the solution of an optimal transport (OT) problem between learned representations of real-world and simulated observations, allowing RoPE to learn a model of the misspecification without placing additional assumptions on its nature. RoPE shows how the calibration set and OT together offer a controllable balance between calibrated uncertainty and informative inference even under severely misspecified simulators. Results on four synthetic tasks and two real-world problems with ground-truth labels demonstrate that RoPE outperforms baselines and consistently returns informative and calibrated credible intervals.
Paperid:212
Authors:Yukang Yang · Declan Campbell · Kaixuan Huang · Mengdi Wang · Jonathan Cohen · Taylor Webb
Abstract:
Many recent studies have found evidence for emergent reasoning capabilities in large language models, but debate persists concerning the robustness of these capabilities, and the extent to which they depend on structured reasoning mechanisms. To shed light on these issues, we perform a comprehensive study of the internal mechanisms that support abstract rule induction in an opensource language model (Llama3-70B). We identify an emergent symbolic architecture that implements abstract reasoning via a series of three computations. In early layers,abstraction headsconvert input tokens to abstract variables based on the relations between those tokens. In intermediate layers,symbolic induction headsperform sequence induction over these variables. Finally, in later layers,retrieval headspredict the next token by retrieving the value associated with the predicted variable. These results point toward a resolution of the longstanding debate between symbolic and neural network approaches, suggesting that emergent reasoning in neural networks depends on the emergence of symbolic mechanisms.
Paperid:213
Authors:Wenhao Shen · Wanqi Yin · Xiaofeng Yang · Cheng Chen · Chaoyue Song · Zhongang Cai · Lei Yang · Hao Wang · Guosheng Lin
Abstract:
Human mesh recovery (HMR) from a single image is inherently illposed due to depth ambiguity and occlusions. Probabilistic methods have tried to solve this by generating numerous plausible 3D human mesh predictions, but they often exhibit misalignment with 2D image observations and weak robustness to in-the-wild images. To address these issues, we propose ADHMR, a framework that Aligns Diffusion-based HMR model in a preference optimization manner. First, we train a human mesh prediction assessment model, HMR-Scorer, capable of evaluating predictions even for in-the-wild images without 3D annotations. We then use HMR-Scorer to create a preference dataset, where each input image has a pair of winner and loser mesh predictions. This dataset is used to finetune the base model using direct preference optimization. Moreover, HMR-Scorer also helps improve existing HMR models by data cleaning, even with fewer training samples. Extensive experiments show that ADHMR outperforms current state-of-the-art methods.
Paperid:214
Authors:Antonis Vasileiou · Ben Finkelshtein · Floris Geerts · Ron Levie · Christopher Morris
Abstract:
Abstract:The expressive power of messagepassing graph neural networks (MPNNs) is reasonably well understood, primarily through combinatorial techniques from graph isomorphism testing. However, MPNNs' generalization abilities---making meaningful predictions beyond the training set---remain less explored. Current generalization analyses often overlook graph structure, limit the focus to specific aggregation functions, and assume the impractical, hard-to-optimize $0$-$1$ loss function. Here, we extend recent advances in graph similarity theory to assess the influence of graph structure, aggregation, and loss functions on MPNNs' generalization abilities. Our empirical study supports our theoretical insights, improving our understanding of MPNNs' generalization properties.
Paperid:215
Authors:Anant Khandelwal
Abstract:
Animating clipart images with seamless motion while maintaining visual fidelity and temporal coherence presents significant challenges. Existing methods, such as AniClipart, effectively model spatial deformations but often fail to ensure smooth temporal transitions, resulting in artifacts like abrupt motions and geometric distortions. Similarly, textto-video (T2V) and image-to-video (I2V) models struggle to handle clipart due to the mismatch in statistical properties between natural video and clipart styles. This paper introduces FlexiClip, a novel approach designed to overcome these limitations by addressing the intertwined challenges of temporal consistency and geometric integrity. FlexiClip extends traditional Bézier curve-based trajectory modeling with key innovations: temporal Jacobians to correct motion dynamics incrementally, continuous-time modeling via probability flow ODEs (pfODEs) to mitigate temporal noise, and a flow matching loss inspired by GFlowNet principles to optimize smooth motion transitions. These enhancements ensure coherent animations across complex scenarios involving rapid movements and non-rigid deformations. Extensive experiments validate the effectiveness of FlexiClip in generating animations that are not only smooth and natural but also structurally consistent across diverse clipart types, including humans and animals. By integrating spatial and temporal modeling with pre-trained video diffusion models, FlexiClip sets a new standard for high-quality clipart animation, offering robust performance across a wide range of visual content. Project Page: https://creative-gen.github.io/flexiclip.github.io/
Paperid:216
Authors:Geigh Zollicoffer · Kenneth Eaton · Jonathan Balloch · Julia Kim · Wei Zhou · Robert Wright · Mark Riedl
Abstract:
Reinforcement learning (RL) using world models has found significant recent successes.However, when a sudden change to world mechanics or properties occurs then agent performance and reliability can dramatically decline.We refer to the sudden change in visual properties or state transitions as novelties.Implementing novelty detection within generated world model frameworks is a crucialtask for protecting the agent when deployed. In this paper, we propose straightforward bounding approaches to incorporate novelty detection into world model RL agents by utilizing the misalignment of the world model's hallucinated states and the true observed states as a novelty score. We provideeffective approaches to detecting novelties in a distribution of transitions learned by an agent ina world model. Finally, we show the advantage ofour work in a novel environment compared to traditional machine learning novelty detection methods as well as currently accepted RLfocused novelty detection algorithms.
Paperid:217
Authors:Aro Lee · Ji Oon Lee
Abstract:
We consider a spiked random matrix model obtained by applying a function entrywise to a signalplus-noise symmetric data matrix. We prove that the largest eigenvalue of this model, which we call a transformed spiked Wigner matrix, exhibits Baik-Ben Arous-Péché (BBP) type phase transition. We show that the law of the fluctuation converges to the Gaussian distribution when the effective signal-to-noise ratio (SNR) is above the critical number, and to the GOE Tracy-Widom distribution when the effective SNR is below the critical number. We provide precise formulas for the limiting distributions and also concentration estimates for the largest eigenvalues, both in the supercritical and the subcritical regimes.
Paperid:218
Authors:William Chen · Jinchuan Tian · Yifan Peng · Brian Yan · Chao-Han Yang · Shinji Watanabe
Abstract:
Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an openaccess, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. One of our key findings is that scaling enhances performance on low-resource languages/dialects, helping to mitigate bias and improve the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models.
Paperid:219
Authors:Youhe Jiang · Fangcheng Fu · Xiaozhe Yao · Guoliang HE · Xupeng Miao · Ana Klimovic · Bin Cui · Binhang Yuan · Eiko Yoneki
Abstract:
Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the costefficiency of LLM serving as common practices primarily rely on homogeneous GPU resources. In response to this problem, this work conducts a thorough study about serving LLMs over heterogeneous GPU resources on cloud platforms. The rationale is that different GPU types exhibit distinct compute and memory characteristics, aligning well with the divergent resource demands of diverse requests. Particularly, through comprehensive benchmarking, we discover that the cost-efficiency of LLM serving can be substantially optimized by meticulously determining GPU composition, deployment configurations, and workload assignments. Subsequently, we design a scheduling algorithm via mixed-integer linear programming, aiming at deducing the most cost-efficient serving plan under the constraints of price budget and real-time GPU availability. Remarkably, our approach effectively outperforms homogeneous and heterogeneous baselines under a wide array of scenarios, covering diverse workload traces, varying GPU availablilities, and multi-model serving. This casts new light on more accessible and efficient LLM serving over heterogeneous cloud resources.
Paperid:220
Authors:Zheng Lian · Haoyu Chen · Lan Chen · Haiyang Sun · Licai Sun · Yong Ren · Zebang Cheng · Bin Liu · Rui Liu · Xiaojiang Peng · Jiangyan Yi · Jianhua Tao
Abstract:
The emergence of multimodal large language models (MLLMs) advances multimodal emotion recognition (MER) to the next level—from naive discriminative tasks to complex emotion understanding with advanced video understanding abilities and natural language description. However, the current community suffers from a lack of largescale datasets with intensive, descriptive emotion annotations, as well as a multimodal-centric framework to maximize the potential of MLLMs for emotion understanding. To address this, we establish a new benchmark for MLLM-based emotion understanding with a novel dataset (MER-Caption), and a new model (AffectGPT). Utilizing our model-based crowd-sourcing data collection strategy, we construct the largest descriptive emotion dataset to date (by far), featuring over 2K fine-grained emotion categories across 115K samples. We also introduce the AffectGPT model, designed with pre-fusion operations to enhance multimodal integration. Finally, we present MER-UniBench, a unified benchmark with evaluation metrics tailored for both typical MER tasks and the free-form, natural language output style of MLLMs. Extensive experimental results demonstrate AffectGPT's robust performance across various MER tasks. We are publicly releasing both the AffectGPT model and the MER-Caption dataset to foster further research and development in emotion understanding.
Paperid:221
Authors:Honghua Dong · Jiacheng Yang · Xun Deng · Yuhe Jiang · Gennady Pekhimenko · Fan Long · Xujie Si
Abstract:
Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have shown promise in code understanding, their type inference capabilities remain underexplored. We introduceTypyBench, a benchmark designed to evaluate LLMs' type inference across entire Python repositories.TypyBenchfeatures two novel metrics:TypeSim, which captures nuanced semantic relationships between predicted and ground truth types, andTypeCheck, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 highquality Python repositories reveals that, although LLMs achieve decentTypeSimscores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency.TypyBenchprovides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts.
Paperid:222
Authors:Max Ruiz Luyten · Antonin Berthon · M van der Schaar
Abstract:
Realworld human decision-making often relies on strategic planning, wherehigh-levelgoals guide the formulation of sub-goals and subsequent actions, as evidenced by domains such as healthcare, business, and urban policy. Despite notable successes in controlled settings, conventional reinforcement learning (RL) follows abottom-upparadigm, which can struggle to adapt to real-world complexities such as sparse rewards and limited exploration budgets. While methods like hierarchical RL and environment shaping provide partial solutions, they frequently rely on either ad-hoc designs (e.g. choose the set of high-level actions) or purely data-driven discovery of high-level actions that still requires significant exploration. In this paper, we introduce aTop-Downframework for RL that explicitly leverageshuman-like strategyto reduce sample complexity, guide exploration, and enable high-level decision-making. We first formalize theStrategy Problem, which frames policy generation as finding distributions over policies that balancespecificityandvalue. Building on this definition, we propose theStrategistagent—an iterative framework that leverages large language models (LLMs) to synthesize domain knowledge into a structured representation of actionable strategies and sub-goals. We further develop areward shaping methodologythat translates these strategies expressed in natural language into quantitative feedback for RL methods. Empirically, we demonstrate a significantly faster convergence than conventional PPO. Taken together, our findings highlight thattop-down strategic explorationopens new avenues for enhancing RL on real-world decision problems.
Paperid:223
Authors:Jerry Yao-Chieh Hu · Bo-Yu Chen · Dennis Wu · Feng Ruan · Han Liu
Abstract:
Abstract:We present a nonparametric interpretation for deep learning compatible modern Hopfield models and utilize this new perspective to debut efficient variants. Our key contribution stems from interpreting the memory storage and retrieval processes in modern Hopfield models as a nonparametric regression problem subject to a set of querymemory pairs.Interestingly,our framework not only recovers the known results from the original dense modern Hopfield model but also fills the void in the literature regarding efficient modern Hopfield models, by introducing \textit{sparse-structured} modern Hopfield models with sub-quadratic complexity.We establish that this sparse model inherits the appealing theoretical properties of its dense analogue --- connection with transformer attention, fixed point convergence and exponential memory capacity.Additionally, we showcase the versatility of our framework by constructing a family of modern Hopfield models as extensions, including linear, random masked, top-$K$ and positive random feature modern Hopfield models.Empirically, we validate our framework in both synthetic and realistic settings for memory retrieval and learning tasks.
Paperid:224
Authors:Tao Feng · Yexin Wu · Guanyu Lin · Jiaxuan You
Abstract:
World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks.Existing WMs primarily focus on unstructured data while cannot leverage the ubiquitous structured data, often represented as graphs, in the digital world. While multiple graph foundation models have been proposed, they focus on graph learning tasks and cannot extend to diverse multimodal data and interdisciplinary tasks. To address these challenges, we propose the Graph World Model (GWM), a world model that supports both unstructured and graph-structured states with multi-modal information and represents diverse tasks as actions. The core of a GWM is a generic message-passing algorithm to aggregate structured information, either over a unified multi-modal token space by converting multi-modal data into text (GWM-T) or a unified multi-modal embedding space by modality-specific encoders (GWM-E). Notably, GWM introduces action nodes to support diverse tasks, where action nodes are linked to other nodes via direct reference or similarity computation. Extensive experiments on 6 tasks from diverse domains, including multi-modal generation and matching, recommendation, graph prediction, multi-agent, retrieval-augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain-specific baselines' performance, benefits from multi-hop structures, and demonstrate strong zero-shot/few-shot capabilities on unseen new tasks.
Paperid:225
Authors:Anle Ke · Xu Zhang · Tong Chen · Ming Lu · Chao Zhou · Jiawen Gu · Zhan Ma
Abstract:
Existing multimodal large modelbased image compression frameworks often rely on a fragmented integration of semantic retrieval, latent compression, and generative models, resulting in suboptimal performance in both reconstruction fidelity and coding efficiency. To address these challenges, we propose a residual-guided ultra lowrate image compression named ResULIC, which incorporates residual signals into both semantic retrieval and the diffusion-based generation process. Specifically, we introduce Semantic Residual Coding (SRC) to capture the semantic disparity between the original image and its compressed latent representation. A perceptual fidelity optimizer is further applied for superior reconstruction quality. Additionally, we present the Compression-aware Diffusion Model (CDM), which establishes an optimal alignment between bitrates and diffusion time steps, improving compression-reconstruction synergy. Extensive experiments demonstrate the effectiveness of ResULIC, achieving superior objective and subjective performance compared to state-of-the-art diffusion-based methods with -80.7\%, -66.3\% BD-rate saving in terms of LPIPS and FID.
Paperid:226
Authors:Jiachen Guo · Xiaoyu Xie · Chanwook Park · Hantao Zhang · Matthew Politis · Gino Domel · Jiachen Guo
Abstract:
Deep learning has been extensively employed as a powerful function approximator for modeling physicsbased problems described by partial differential equations (PDEs). Despite their popularity, standard deep learning models often demand prohibitively large computational resources and yield limited accuracy when scaling to large-scale, high-dimensional physical problems. Their black-box nature further hinders their application in industrial problems where interpretability and high precision are critical. To overcome these challenges, this paper introduces Interpolation Neural Network-Tensor Decomposition (INN-TD), a scalable and interpretable framework that has the merits of both machine learning and finite element methods for modeling large-scale physical systems. By integrating locally supported interpolation functions from finite element into the network architecture, INN-TD achieves a sparse learning structure with enhanced accuracy, faster training/solving speed, and reduced memory footprint. This makes it particularly effective for tackling large-scale high-dimensional parametric PDEs in training, solving, and inverse optimization tasks in physical problems where high precision is required.
Paperid:227
Authors:Longlin Yu · Jiajun Zha · Tong Yang · Tianyu Xie · Xiangyu Zhang · Gary Chan · Cheng Zhang
Abstract:
Semiimplicit distributions have shown great promise in variational inference and generative modeling. Hierarchical semi-implicit models, which stack multiple semi-implicit layers, enhance the expressiveness of semi-implicit distributions and can be used to accelerate diffusion models given pretrained score networks. However, their sequential training often suffers from slow convergence. In this paper, we introduce CoSIM, a continuous semi-implicit model that extends hierarchical semi-implicit models into a continuous framework. By incorporating a continuous transition kernel, CoSIM enables efficient, simulation-free training. Furthermore, we show that CoSIM achieves consistency with a carefully designed transition kernel, introducing a novel training scheme for consistency models. Extensive experiments on image generation demonstrate that CoSIM performs on par or better than existing diffusion model acceleration methods, achieving superior performance on FD-DINOv2.
Paperid:228
Authors:Samuel Miserendino · Michele Wang · Tejal Patwardhan · Johannes Heidecke
Abstract:
Abstract:We introduce SWELancer, a benchmark of over 1400 freelance software engineering tasks from Upwork, valued at \\\$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks — ranging from \\\$50 bug fixes to \\\$32000 feature implementations — and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split. By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.
Paperid:229
Authors:Marco Cipriano · Moritz Feuerpfeil · Gerard de Melo
Abstract:
Scalable Vector Graphics (SVG) is a popular format on the web and in the design industry. However, despite the great strides made in generative modeling, SVG has remained underexplored due to the discrete and complex nature of such data. We introduce GRIMOIRE, a textguided SVG generative model that is comprised of two modules: A Visual Shape Quantizer (VSQ) learns to map raster images onto a discrete codebook by reconstructing them as vector shapes, and an Auto-Regressive Transformer (ART) models the joint probability distribution over shape tokens, positions and textual descriptions, allowing us to generate vector graphics from natural language. Unlike existing models that require direct supervision from SVG data, GRIMOIRE learns shape image patches using only raster image supervision which opens up vector generative modeling to significantly more data. We demonstrate the effectiveness of our method by fitting GRIMOIRE for closed filled shapes on the MNIST and Emoji, and for outline strokes on icon and font data, surpassing previous image-supervised methods in generative quality and vector-supervised approach in flexibility.
Paperid:230
Authors:Hai Ci · Yiren Song · Pei Yang · Jinheng Xie · Mike Zheng Shou
Abstract:
Watermarking is essential for protecting the copyright of AIgenerated images. We propose WMAdapter, a diffusion model watermark plugin that embeds user-specified watermark information seamlessly during the diffusion generation process. Unlike previous methods that modify diffusion modules to incorporate watermarks, WMAdapter is designed to keep all diffusion components intact, resulting in sharp, artifact-free images. To achieve this, we introduce two key innovations: (1) We develop a contextual adapter that conditions on the content of the cover image to generate adaptive watermark embeddings. (2) We implement an additional finetuning step and a hybrid finetuning strategy that suppresses noticeable artifacts while preserving the integrity of the diffusion components. Empirical results show that WMAdapter provides strong flexibility, superior image quality, and competitive watermark robustness.
Paperid:231
Authors:Jing Xu · Jiazheng Li · Jingzhao Zhang
Abstract:
Model merging offers an effective way to integrate the capabilities of multiple finetuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performances. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer-wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods,ProDistill achieves state-of-the-art performance, with up to 6.14\% and 6.61\% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill.
Paperid:232
Authors:Sreyan Ghosh · Zhifeng Kong · Sonal Kumar · S Sakshi · Jaehyeon Kim · Wei Ping · Rafael Valle · Dinesh Manocha · Bryan Catanzaro
Abstract:
Understanding and reasoning over nonspeech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic AQA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across 20+ benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs - 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach. All code and data will be open-sourced. Demo: https://sites.google.com/view/audioflamingo2
Paperid:233
Authors:Yeyun Chen
Abstract:
Predicting the solvation free energy of molecules using graph neural networks holds significant potential for advancing drug discovery and the design of novel materials. While previous models have demonstrated success on independent and identically distributed (IID) datasets, their performance in outof-distribution (OOD) scenarios remains largely unexplored. We propose a novel Relational Invariant Learning framework (RILOOD) to enhance OOD generalization in solvation free energy prediction. RILOOD comprises three key components: (i) a mixup-based conditional modeling module that integrates diverse environments, (ii) a novel multi-granularity refinement strategy that extends beyond core substructures to enable context-aware representation learning for capturing multi-level interactions, and (iii) an invariant learning module that identifies robust patterns generalizable to unseen environments. Extensive experiments demonstrate that RILOOD significantly outperforms state-of-the-art methods across various distribution shifts, highlighting its effectiveness in improving solvation free energy prediction under diverse conditions.
Paperid:234
Authors:Quoc Tran-Dinh
Abstract:
Abstract:We develop two novel stochastic variancereduction methods to approximate solutions of a class of nonmonotone [generalized] equations. Our algorithms leverage a new combination of ideas from the forward-reflected-backward splitting method and a class of unbiased variance-reduced estimators. We construct two new stochastic estimators within this class, inspired by the well-known SVRG and SAGA estimators. These estimators significantly differ from existing approaches used in minimax and variational inequality problems. By appropriately choosing parameters, both algorithms achieve state-of-the-art oracle complexity of $\mathcal{O}(n + n^{2/3} \epsilon^{-2})$ for obtaining an $\epsilon$-solution in terms of the operator residual norm for a class of nonmonotone problems, where $n$ is the number of summands and $\epsilon$ signifies the desired accuracy. This complexity aligns with the best-known results in SVRG and SAGA methods for stochastic nonconvex optimization. We test our algorithms on two numerical examples and compare them with existing methods. The results demonstrate promising improvements offered by the new methods compared to their competitors.
Paperid:235
Authors:Runxi Cheng · Feng Xiong · Yongxian Wei · Wanyun Zhu · Chun Yuan
Abstract:
Model merging seeks to integrate taskspecific expert models into a unified architecture while preserving multi-task generalization capabilities, yet parameter interference between constituent models frequently induces performance degradation. Although prior work has explored many merging strategies, resolving interference without additional data for retraining or test-time computation remains challenging. In this paper, we theoretically demonstrate that the task vectors of the linear layer constitute an approximate linear subspace for its corresponding input. Therefore, we can minimize interference under the guidance of task vectors. Based on this insight, we propose \textbf{WUDI-Merging} (\textbf{W}hoever started the interference sho\textbf{U}ld en\textbf{D} \textbf{I}t), a simple yet effective model merging method that eliminates interference without any additional data or rescaling coefficients. Comprehensive empirical evaluations across vision and language benchmarks demonstrate our method's superiority, achieving state-of-the-art performance in data-free model merging scenarios (average 10.9\% improvement versus baseline methods) while even outperforming mainstream test-time adaptation approaches by 3.3\%, and only very few computing resources are required. The code will be publicly available soon.
Paperid:236
Authors:Naoki Nishikawa · Yujin Song · Kazusato Oko · Denny Wu · Taiji Suzuki
Abstract:
Abstract:Pretrained transformers have demonstrated the ability to implement various algorithms at inference time without parameter updates. While theoretical works have established this capability through constructions and approximation guarantees, the optimization and statistical efficiency aspects remain understudied. In this work, we investigate how transformers learn features incontext -- a key mechanism underlying their inference-time adaptivity. We focus on the in-context learning of single-index models $y=\sigma_*(\langle\boldsymbol x,\boldsymbol\beta\rangle)$, which are low-dimensional nonlinear functions parameterized by feature vector $\boldsymbol\beta$. We prove that transformers pretrained by gradient-based optimization can perform *inference-time feature learning*, i.e., extract information of $\boldsymbol\beta$ solely from test prompts (despite $\boldsymbol\beta$ varying across different prompts), hence achieving inference-time statistical efficiency that surpasses any non-adaptive (fixed-basis) algorithms such as kernel methods. Moreover, we show that the inference-time sample complexity surpasses the Correlational Statistical Query (CSQ) lower bound, owing to nonlinear label transformations naturally induced by the self-attention mechanism.
Paperid:237
Authors:Jiujia Zhang · Ashok Cutkosky
Abstract:
Abstract:This paper addresses online learning with ''corrupted'' feedback. Our learner is provided with potentially corrupted gradients $\tilde g_t$ instead of the ''true'' gradients $g_t$. We make no assumptions about how the corruptions arise: they could be the result of outliers, mislabeled data, or even malicious interference. We focus on the difficult ``unconstrained'' setting in which our algorithm must maintain low regret with respect to any comparison point $\|u\| \in \R^d$. The unconstrained setting is significantly more challenging as existing algorithms suffer extremely high regret even with very tiny amounts of corruption (which is not true in the case of a bounded domain). Our algorithms guarantee regret $ \|u\|G (\sqrt{T} + k) $ when $G \ge \max_t \|g_t\|$ is known, where $k$ is a measure of the total amount of corruption. When $G$ is unknown we incur an extra additive penalty of $(\|u\|^2+G^2) k$.
Paperid:238
Authors:Xudong Gong · Sen Yang · Feng Dawei · Kele Xu · Bo Ding · Huaimin Wang · Yong Dou
Abstract:
This paper addresses the challenge of discontinuity in goalachievement capabilities observed in Goal-conditioned Reinforcement Learning (GCRL) algorithms. Through a theoretical analysis, we identify that the reuse of successful trajectories or policies during training can aid in achieving adjacent goals of achievable goals. However, the policy discrepancy between achievable and adjacent goals must be carefully managed to avoid both overly trivial and excessively large differences, which can respectively hinder policy performance. To tackle this issue, we propose a margin-based policy self-regularization approach that optimizes the policy discrepancies between adjacent desired goals to a minimal acceptable threshold. This method can be integrated into popular GCRL algorithms, such as GC-SAC, HER, and GC-PPO. Systematic evaluations across two robotic arm control tasks and a complex fixed-wing aircraft control task demonstrate that our approach significantly improves the continuity of goal-achievement abilities of GCRL algorithms, thereby enhancing their overall performance.
Paperid:239
Authors:Paolo Pellizzoni · Simone Moretti · Francesco Silvestri
Abstract:
Abstract:The weighted Euclidean norm $\lVert x\rVert_w$ of a vector $x\in \mathbb{R}^d$ with weights $w\in \mathbb{R}^d$ is the Euclidean norm where the contribution of each dimension is scaled by a given weight. Approaches to dimensionality reduction that satisfy the Johnson–Lindenstrauss (JL) lemma can be easily adapted to the weighted Euclidean distance if weights are known and fixed: it suffices to scale each dimension of the input vectors according to the weights, and then apply any standard approach.However, this is not the case when weights are unknown during the dimensionality reduction or might dynamically change.In this paper, we address this issue by providing an approach that maps vectors into a smaller complex vector space and allows to retrieve a JLlike estimate for the weighted Euclidean distance once weights are revealed.Our result are based on the decomposition of the complex dimensionality reduction into several Rademacher chaos random variables, which are studied using novel concentration inequalities for sums of independent Rademacher chaoses.
Paperid:240
Authors:Nuoya Xiong · Aarti Singh
Abstract:
Reinforcement Learning with Human Feedback (RLHF) is a widely used finetuning approach that aligns machine learning models, particularly Language Models (LMs) with human preferences. There are typically multiple objectives driving the preference, hence humans find it easier to express per-objective comparisons rather than a global preference between two choices, e.g. compare two papers on their novelty, clarity, correctness, etc. Multi-Objective RLHF aims to use per-objective preference feedback and achieve a Pareto optimal tradeoff among these objectives by aggregating them into a single unified objective for optimization. However, nearly all prior works rely on linear aggregation, which rules out policies that favor specific objectives such as the worst one. The only existing approach using non-linear aggregation is computationally expensive due to its reward-based nature and the need for retraining whenever the aggregation parameters change. In this work, we address this limitation by transforming the non-linear aggregation maximization problem into a series of sub-problems. Each sub-problem involves only linear aggregation, making it computationally efficient to solve. We further extend our framework to handle multi-group scenarios, where each group has distinct weights for the objectives. Our method enables achieving consensus or maximizing the aggregated objective across all groups. Theoretically, we demonstrate that our algorithmic framework achieves sublinear regret and can be easily adapted to a reward-free algorithm. Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained.
Paperid:241
Authors:Giulio Starace · Oliver Jaffe · Dane Sherburn · James Aung · Jun Shern Chan · Leon Maksin · Rachel Dias · Evan Mays · Benjamin Kinsella · Wyatt Thompson · Johannes Heidecke · Amelia Glaese · Tejal Patwardhan
Abstract:
We introduce PaperBench, a benchmark for evaluating AI agents' ability to replicate stateof-the-art machine learning research. We curate 18 Spotlight and Oral papers from ICML 2024 to build a dataset of papers representing the current frontier of machine learning (ML) research. Replicating these papers requires real-world ML engineering skills such as understanding the literature, preparing datasets, training models, and running experiments. We manually createrubricswith the author(s) of each paper, which hierarchically decompose each replication task into smaller sub-tasks along with clear grading criteria. We develop an LLM-basedjudgeto grade replication attempts against rubrics, and assess the judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing agent -- Claude Sonnet 3.5 with open-source scaffolding -- achieves a replication score of 14.1\%. Finally, we recruit ML PhDs to attempt a 5-paper subset of PaperBench, finding that human participants achieve 22.0\%, compared to Claude 3.5 Sonnet which scores 11.0\% on this subset. We open-source our code to facilitate future research in understanding the ML engineering capabilities of AI agents.
Paperid:242
Authors:Deqian Kong · Minglu Zhao · Dehong Xu · Bo Pang · Shu Wang · Edouardo Honig · Zhangzhang Si · Chuan Li · Jianwen Xie · Sirui Xie · Ying Nian Wu
Abstract:
We propose a novel family of language models, LatentThought Language Models (LTMs), which incorporate explicit latent thought vectors that guide the autoregressive generation of ground tokens through a Transformer decoder. Training employs a dual-rate optimization process within the classical variational Bayes framework: fast learning of local variational parameters for the posterior distribution of latent vectors, and slow learning of global decoder parameters. Empirical studies reveal that LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space. Higher sample efficiency can be achieved by increasing training compute per token, with further gains possible by trading model size for more inference steps. Designed based on these scaling properties, LTMs demonstrate superior sample and parameter efficiency compared to conventional autoregressive models and discrete diffusion models. They significantly outperform these counterparts in validation perplexity and zero-shot language modeling. Additionally, LTMs exhibit emergent few-shot in-context reasoning capabilities that scale with model and latent size, and achieve competitive performance in unconditional and conditional text generation.
Paperid:243
Authors:Viacheslav Meshchaninov · Pavel Strashnov · Andrey Shevtsov · Fedor Nikolaev · Nikita Ivanisenko · Olga Kardymon · Dmitry Vetrov
Abstract:
Proteinsequencedesign has seen significant advances through discrete diffusion and autoregressive approaches, yet the potential of continuous diffusion remains underexplored. Here, we presentDiMA, a latent diffusion framework that operates on protein language model representations. Through systematic exploration of architectural choices and diffusion components, we develop a robust methodology that generalizes across multiple protein encoders ranging from 8M to 3B parameters. We demonstrate that our framework achieves consistently high performance across sequenceonly (ESM-2, ESMc), dual-decodable (CHEAP), and multimodal (SaProt) representations using the same architecture and training approach. We conduct extensive evaluation of existing methods alongsideDiMAusing multiple metrics across two protein modalities, covering quality, diversity, novelty, and distribution matching of generated proteins.DiMAconsistently produces novel, high-quality and diverse protein sequences and achieves strong results compared to baselines such as autoregressive, discrete diffusion and flow matching language models. The model demonstrates versatile functionality, supporting conditional generation tasks including protein family-generation, motif scaffolding and infilling, and fold-specific sequence design, despite being trained solely on sequence data. This work provides a universal continuous diffusion framework for protein sequence generation, offering both architectural insights and practical applicability across various protein design scenarios.
Paperid:244
Authors:Yushu Shi · Chao Huang · Jie Wen · Wei Wang · Yong Xu · Xiaochun Cao
Abstract:
With advancements in visual language models (VLMs) and large language models (LLMs), video anomaly detection (VAD) has progressed beyond binary classification to finegrained categorization and multidimensional analysis. However, existing methods focus mainly on coarse-grained detection, lacking anomaly explanations. To address these challenges, we propose Ex-VAD, an Explainable Fine-grained Video Anomaly Detection approach that combines fine-grained classification with detailed explanations of anomalies. First, we use a VLM to extract frame-level captions, and an LLM converts them to video-level explanations, enhancing the model's explainability. Second, integrating textual explanations of anomalies with visual information greatly enhances the model's anomaly detection capability. Finally, we apply label-enhanced alignment to optimize feature fusion, enabling precise fine-grained detection. Experiments on the UCF-Crime and XD-Violence datasets demonstrate that Ex-VAD significantly outperforms existing methods and enhances model transparency with detailed explanatory texts. Extensive experimental results on the UCF-Crime and XD-Violence datasets demonstrate that Ex-VAD significantly outperforms existing State-of-The-Art methods.
Paperid:245
Authors:Hao Yu · Weixuan Liang · KE LIANG · Suyuan Liu · Meng Liu · Xinwang Liu
Abstract:
Multikernel clustering (MKC) has emerged as a powerful method for capturing diverse data patterns, offering robust and generalized representations of data structures. However, the increasing deployment of MKC in real-world applications raises concerns about its vulnerability to adversarial perturbations. While adversarial robustness has been extensively studied in other domains, its impact on MKC remains largely unexplored. In this paper, we address the challenge of assessing the adversarial robustness of MKC methods in a black-box setting. Specifically, we proposeAdvMKC, a novel reinforcement-learning-based adversarial attack framework designed to inject imperceptible perturbations into data and mislead MKC methods. AdvMKC leverages proximal policy optimization with an advantage function to overcome the instability of clustering results during optimization. Additionally, it introduces a generator-clusterer framework, where a generator produces adversarial perturbations, and a clusterer approximates MKC behavior, significantly reducing computational overhead. We provide theoretical insights into the impact of adversarial perturbations on MKC and validate these findings through experiments. Evaluations across seven datasets and eleven MKC methods (seven traditional and four robust) demonstrate AdvMKC's effectiveness, robustness, and transferability.
Paperid:246
Authors:Muleilan Pei · Shaoshuai Shi · Lu Zhang · Peiliang Li · Shaojie Shen
Abstract:
Trajectory prediction for surrounding agents is a challenging task in autonomous driving due to its inherent uncertainty and underlying multimodality. Unlike prevailing datadriven methods that primarily rely on supervised learning, in this paper, we introduce a novel Graph-oriented Inverse Reinforcement Learning (GoIRL) framework, which is an IRL-based predictor equipped with vectorized context representations. We develop a feature adaptor to effectively aggregate lane-graph features into grid space, enabling seamless integration with the maximum entropy IRL paradigm to infer the reward distribution and obtain the policy that can be sampled to induce multiple plausible plans. Furthermore, conditioned on the sampled plans, we implement a hierarchical parameterized trajectory generator with a refinement module to enhance prediction accuracy and a probability fusion strategy to boost prediction confidence. Extensive experimental results showcase our approach not only achieves state-of-the-art performance on the large-scale Argoverse & nuScenes motion forecasting benchmarks but also exhibits superior generalization abilities compared to existing supervised models.
Paperid:247
Authors:Pierre C Bellec · Takuya Koriyama
Abstract:
Abstract:This paper studies phase transitions for the existence of unregularized Mestimators under proportional asymptotics where the sample size $n$ and feature dimension $p$ grow proportionally with $n/p \to \delta \in (1, \infty)$. We study the existence of M-estimators in single-index models where the response $y_i$ depends on covariates $x_i \sim N(0, I_p)$ through an unknown index ${w} \in \mathbb{R}^p$ and an unknown link function. An explicit expression is derived for the critical threshold $\delta_\infty$ that determines the phase transition for the existence of the M-estimator, generalizing the results of Candés & Sur (2020) for binary logistic regression to other single-index models.Furthermore, we investigate the existence of a solution to the nonlinear system of equations governing the asymptotic behavior of the M-estimator when it exists. The existence of solution to this system for $\delta > \delta_\infty$ remains largely unproven outside the global null in binary logistic regression. We address this gap with a proof that the system admits a solution if and only if $\delta > \delta_\infty$, providing a comprehensive theoretical foundation for proportional asymptotic results that require as a prerequisite the existence of a solution to the system.
Paperid:248
Authors:Gábor Pituk · Vik Shirvaikar · Tom Rainforth
Abstract:
We empirically investigate how well popular approximate inference algorithms for Bayesian Neural Networks (BNNs) respect the theoretical properties of Bayesian belief updating. We find strong evidence on synthetic regression and realworld image classification tasks that common BNN algorithms such as variational inference, Laplace approximation, SWAG, and SGLD fail to update in a consistent manner, forget about old data under sequential updates, and violate the predictive coherence properties that would be expected of Bayesian methods. These observed behaviors imply that care should be taken when treating BNNs as true Bayesian models, particularly when using them beyond static prediction settings, such as for active, continual, or transfer learning.
Paperid:249
Authors:Zehan Wang · Ziang Zhang · Tianyu Pang · Chao Du · Hengshuang Zhao · Zhou Zhao
Abstract:
Orientation is a fundamental attribute of objects, essential for understanding their spatial pose and arrangement. However, practical solutions for estimating the orientation of openworld objects in monocular images remain underexplored. In this work, we introduce Orient Anything, the first foundation model for zero-shot object orientation estimation. A key challenge in this task is the scarcity of orientation annotations for open-world objects. To address this, we propose leveraging the vast resources of 3D models. By developing a pipeline to annotate the front face of 3D objects and render them from random viewpoints, we curate 2 million images with precise orientation annotations across a wide variety of object categories. To fully leverage the dataset, we design a robust training objective that models the 3D orientation as probability distributions over three angles and predicts the object orientation by fitting these distributions. Besides, we propose several strategies to further enhance the synthetic-to-real transfer. Our model achieves state-of-the-art orientation estimation accuracy on both rendered and real images, demonstrating impressive zero-shot capabilities across various scenarios. Furthermore, it shows great potential in enhancing high-level applications, such as understanding complex spatial concepts in images and adjusting 3D object pose.
Paperid:250
Authors:Chi Zhang · REN Lianhai · Jingpu Cheng · Qianxiao Li
Abstract:
The LoRA method has achieved notable success in reducing GPU memory usage by applying lowrank updates to weight matrices. Yet, one simple question remains: can we push this reduction even further? Furthermore, is it possible to achieve this while improving performance and reducing computation time? Answering these questions requires moving beyond the conventional weight-centric approach. In this paper, we present a state-based fine-tuning framework that shifts the focus from weight adaptation to optimizing forward states, with LoRA acting as a special example. Specifically, state-based tuning introduces parameterized perturbations to the states within the computational graph, allowing us to control states across an entire residual block. A key advantage of this approach is the potential to avoid storing large intermediate states in models like transformers. Empirical results across multiple architectures—including ViT, RoBERTa, LLaMA2-7B, and LLaMA3-8B—show that our method further reduces memory consumption and computation time while simultaneously improving performance. Moreover, as a result of memory reduction, we explore the feasibility to train 7B/8B models on consumer-level GPUs like Nvidia 3090, without model quantization. The code is available at an anonymous GitHub repository
Paperid:251
Authors:Niv Buchbinder · Roie Levin · Yue Yang
Abstract:
Abstract:In *fullydynamic consistent clustering*, we are given a finite metric space $(M,d)$, and a set $F\subseteq M$ of possible locations for opening centers. Data points arrive and depart, and the goal is to maintain an approximately optimal clustering solution at all times while minimizing the *recourse*, the total number of additions/deletions of centers over time. Specifically, we study fully dynamic versions of the classical $k$-center, facility location, and $k$-median problems. We design algorithms that, given a parameter $\beta\geq 1$, maintain an $O(\beta)$-approximate solution at all times, and whose total recourse is bounded by $O(\log |F| \log \Delta) \cdot OPT_{rec}^{\beta}$. Here $OPT_{rec}^{\beta}$ is the minimal recourse of an offline algorithm that maintains a $\beta$-approximate solution at all times, and $\Delta$ is the metric aspect ratio. We obtain our results via a reduction to the recently proposed *Positive Body Chasing* framework of [Bhattacharya Buchbinder Levin Saranurak, FOCS 2023], which we show gives fractional solutions to our clustering problems online. Our contribution is to round these fractional solutions while preserving the approximation and recourse guarantees. We complement our positive results with logarithmic lower bounds which show that our bounds are nearly tight.
Paperid:252
Authors:Changsheng Wang · Yihua Zhang · jinghan jia · Parikshit Ram · Dennis Wei · Yuguang Yao · Soumyadeep Pal · Nathalie Baracaldo · Sijia Liu
Abstract:
Machine unlearning presents a promising approach to mitigating privacy and safety concerns in large language models (LLMs) by enabling the selective removal of targeted data or knowledge while preserving model utility. However, existing unlearning methods remain oversensitive to downstream fine-tuning, which can rapidly recover what is supposed to be unlearned information even when the fine-tuning task is entirely unrelated to the unlearning objective.To enhance robustness, we introduce the concept of `invariance' into unlearning for the first time from the perspective of invariant risk minimization (IRM), a principle for environment-agnostic training. By leveraging IRM, we develop a new invariance-regularized LLM unlearning framework, termed invariant LLM unlearning (ILU). We show that the proposed invariance regularization, even using only a single fine-tuning dataset during ILU training, can enable unlearning robustness to generalize effectively across diverse and new fine-tuning tasks at test time.A task vector analysis is also provided to further elucidate the rationale behind ILU's effectiveness. Extensive experiments on the WMDP benchmark, which focuses on removing an LLM's hazardous knowledge generation capabilities, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.
Paperid:253
Authors:Senyu Han · Hongchuan Zeng · Kai Yu · Lu Chen
Abstract:
Large language models (LLMs) consist of numerous Transformer modules, and while the models can perform various functions, it remains an open question of how these modules are combined to elicit distinct inherent functionalities. In this paper, we investigate the modules inside LLMs and demonstrate that, by simply masking or retaining specific attention heads during inference, LLMs can exhibit specific task functionalities without requiring explicit instructions or modifications to the model parameters. Experiments across various models and tasks reveal that LLMs inherently encode ``functional pathways'', the structured groups of interdependent attention heads that are crucial for executing specific tasks. These pathways not only govern the model's functional behaviors but also enhance parameter efficiency, as suppressing attention heads outside the pathway can improve task performance. The code is available in this repository:Anonymous Github.
Paperid:254
Authors:Ke Fan · Jinpeng Zhang · Xuefeng Zhang · Yunze Wu · Jingyu Cao · Yuan Zhou · Jianzhu Ma
Abstract:
Motivated by the first priority of safety in many realworld applications, we propose \textsc{MaxSafe}, a chance-constrained bi-level optimization framework for safe reinforcement learning. \textsc{MaxSafe} first minimizes the unsafe probability and then maximizes the return among the safest policies. We provide a tailored Q-learning algorithm for the \textsc{MaxSafe} objective, featuring a novel learning process for \emph{optimal action masks} with theoretical convergence guarantees. To enable the application of our algorithm to large-scale experiments, we introduce two key techniques: \emph{safety polarization} and \emph{safety prioritized experience replay}. Safety polarization generalizes the optimal action masking by polarizing the Q-function, which assigns low values to unsafe state-action pairs, effectively discouraging their selection. In parallel, safety prioritized experience replay enhances the learning of optimal action masks by prioritizing samples based on temporal-difference (TD) errors derived from our proposed state-action reachability estimation functions. This approach efficiently addresses the challenges posed by sparse cost signals. Experiments on diverse autonomous driving and safe control tasks show that our methods achieve near-maximal safety and an optimal reward-safety trade-off.
Paperid:255
Authors:Yin Lu · Tong He · Xuening Zhu · David Wipf
Abstract:
Is there really much more to say about sparse autoencoders (SAEs)? Autoencoders in general, and SAEs in particular, represent deep architectures that are capable of modeling lowdimensional latent structure in data. Such structure could reflect, among other things, correlation patterns in large language model activations, or complex natural image manifolds. And yet despite the wide-ranging applicability, there have been relatively few changes to SAEs beyond the original recipe from decades ago, namely, standard deep encoder/decoder layers trained with a classical/deterministic sparse regularizer applied within the latent space. One possible exception is the variational autoencoder (VAE), which adopts a stochastic encoder module capable of producing sparse representations when applied to manifold data. In this work we formalize underappreciated weaknesses with both canonical SAEs, as well as analogous VAEs applied to similar tasks, and propose a hybrid alternative model that circumvents these prior limitations. In terms of theoretical support, we prove that global minima of our proposed model recover certain forms of structured data spread across a union of manifolds. Meanwhile, empirical evaluations on synthetic and real-world datasets substantiate the efficacy of our approach in accurately estimating underlying manifold dimensions and producing sparser latent representations without compromising reconstruction error. In all cases, we are able to exceed the performance of equivalent-capacity SAEs and VAEs, as well as recent diffusion models where applicable.
Paperid:256
Authors:Andrea Cini · Alexander Jenkins · Danilo Mandic · Cesare Alippi · Filippo Maria Bianchi
Abstract:
We address the problem of uncertainty quantification in time series forecasting by exploiting observations at correlated sequences. Relational deep learning methods leveraging graph representations are among the most effective tools for obtaining point estimates from spatiotemporal data and correlated time series. However, the problem of exploiting relational structures to estimate the uncertainty of such predictions has been largely overlooked in the same context. To this end, we propose a novel distributionfree approach based on the conformal prediction framework and quantile regression. Despite the recent applications of conformal prediction to sequential data, existing methods operate independently on each target time series and do not account for relationships among them when constructing the prediction interval. We fill this void by introducing a novel conformal prediction method based on graph deep learning operators. Our method, named Conformal Relational Prediction (CoRel), does not require the relational structure (graph) to be known as a prior and can be applied on top of any pre-trained time series predictor. Additionally, CoRel includes an adaptive component to handle non-exchangeable data and changes in the input time series. Our approach provides accurate coverage and archives state-of-the-art uncertainty quantification in relevant benchmarks.
Paperid:257
Authors:Song Bian · Minghao Yan · Shivaram Venkataraman
Abstract:
Abstract:Scaling laws are powerful tools to predict the performance of large language models. However, current scaling laws fall short of accounting for inference costs. In this work, we first show that model architecture affects inference latency, where models of the same size can have up to $3.5\times$ difference in latency. To tackle this challenge, we modify the Chinchilla scaling laws to cooptimize the model parameter count, the number of training tokens, and the model architecture. Due to the reason that models of similar training loss exhibit gaps in downstream evaluation, we also propose a novel method to train inference-efficient models based on the revised scaling laws. We perform extensive empirical studies to fit and evaluate our inference-aware scaling laws. We vary model parameters from 80M to 1B, training tokens from 1.6B to 30B, and model shapes, training a total of 63 models. Guided by our inference-efficient scaling law and model selection method, we release the Morph-1B model, which improves inference latency by $1.8\times$ while maintaining accuracy on downstream tasks compared to open-source models, pushing the Pareto frontier of accuracy-latency tradeoff.
Paperid:258
Authors:Philippe Chlenski · Quentin Chu · Raiyan Khan · Kaizhu Du · Antonio Moretti · Itsik Pe'er
Abstract:
Decision trees (DTs) and their random forest (RF) extensions are workhorses of classification and regression in Euclidean spaces. However, algorithms for learning in nonEuclidean spaces are still limited. We extend DT and RF algorithms to product manifolds: Cartesian products of several hyperbolic, hyperspherical, or Euclidean components. Such manifolds handle heterogeneous curvature while still factorizing neatly into simpler components, making them compelling embedding spaces for complex datasets. Our novel angular reformulation of DTs respects the geometry of the product manifold, yielding splits that are geodesically convex, maximum-margin, and composable. In the special cases of single-component manifolds, our method simplifies to its Euclidean or hyperbolic counterparts, or introduces hyperspherical DT algorithms, depending on the curvature. We benchmark our method on various classification, regression, and link prediction tasks on synthetic data, graph embeddings, mixed-curvature variational autoencoder latent spaces, and empirical data. Compared to 7 other classifiers, product RFs ranked first on 25 out of 57 benchmarks, and placed in the top 2 for 46 out of 57. This highlights the value of product RFs as straightforward yet powerful new tools for data analysis in product manifolds.
Paperid:259
Authors:Yongqiang Chen · QUANMING YAO · Juzheng Zhang · James Cheng · Yatao Bian
Abstract:
Recent years have witnessed growing interest in extending the capabilities of large language models (LLMs) to molecular science. As LLMs are predominantly trained with text data, most existing approaches adopt a graph neural network to represent a molecule as a series of node tokens for moleculelanguage alignment, which, however, have overlooked the inherent hierarchical structures in molecules. Notably, high-order molecular structures contain rich semantics of functional groups, which encode crucial bio-chemical functionalities of the molecules. We show that neglecting the hierarchical information in tokenization will lead to subpar molecule-language alignment and severe hallucination. To address this limitation, we propose a novel strategy called HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that encodes the hierarchy of atom, motif, and molecular levels of informative tokens to improve the molecular perception of LLMs. HIGHT also adopts an augmented instruction tuning dataset, enriched with the hierarchical graph information, to further enhance the molecule-language alignment. Extensive experiments on 14 real-world benchmarks verify the effectiveness of HIGHT in reducing hallucination by 40%, and significant improvements in various molecule-language downstream tasks.
Paperid:260
Authors:Anvith Thudi · Evianne Rovers · Yangjun Ruan · Tristan Thrush · Chris Maddison
Abstract:
Abstract:Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pretraining large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional compute for a pythia-$410M$ model trained on $8.2B$ tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, SciQ, and OpenWebMath. Crucially, we found that MixMin mixtures for smaller models improved training of larger models, suggesting that MixMin mixtures may be scale-invariant. When mixing bioassay data to train an XGBoost model, we saw improvements to average precision scores of $0.03-0.15$.
Paperid:261
Authors:Jixian Zhou · Fang DONG · Ruijun Huang · Hengjie Cao · Mengyi Chen · Yifeng Yang · Anrui Chen · Mingzhi Dong · Yujiang Wang · Dongsheng Li · David Clifton · Qin Lv · Rui Zhu · Chun Zhang · Fan Yang · Tun Lu · Ning Gu · Li Shang
Abstract:
Mixtureof-Experts (MoE) is widely adopted to deploy Large Language Models (LLMs) on edge devices with limited memory budgets.Although MoE is, in theory, an inborn memory-friendly architecture requiring only a few activated experts to reside in the memory for inference, current MoE architectures cannot effectively fulfill this advantage and will yield intolerable inference latencies of LLMs on memory-constrained devices. Our investigation pinpoints the essential cause as the remarkable temporal inconsistencies of inter-token expert activations, which generate overly frequent expert swapping demands dominating the latencies. To this end, we propose a novel MoE architecture, Oracle-MoE, to fulfill the real on-device potential of MoE-based LLMs. Oracle-MoE route tokens in a highly compact space suggested by attention scores, termed theoracle space, to effectively maintain the semantic locality across consecutive tokens to reduce expert activation variations, eliminating massive swapping demands. Theoretical analysis proves that Oracle-MoE is bound to provide routing decisions with better semantic locality and, therefore, better expert activation consistencies. Experiments on the pretrained GPT-2 architectures of different sizes (200M, 350M, 790M, and 2B) and downstream tasks demonstrate that without compromising task performance, our Oracle-MoE has achieved state-of-the-art inference speeds across varying memory budgets, revealing its substantial potential for LLM deployments in industry.
Paperid:262
Authors:Lele Fu · Bowen Deng · Sheng Huang · Tianchi Liao · Shirui Pan · Chuan Chen
Abstract:
Federated graph learning (FGL) aims to collaboratively train a global graph neural network (GNN) on multiple private graphs with preserving the local data privacy. Besides the common cases of data heterogeneity in conventional federated learning, FGL faces the unique challenge of topology heterogeneity. Most of existing FGL methods alleviate the negative impact of heterogeneity by introducing global signals. However, the manners of creating increments might not be effective and significantly increase the computation amount. In light of this, we propose the FedATH, a FGL method with Alleviating Topology Heterogeneity from a causal perspective. Inspired by the causal theory, we argue that not all edges in a topology are necessary for the training objective, less topology information might make more sense. With the aid of edge evaluator, the local graphs are divided into causal and biased subgraphs. A dualGNN architecture is used to encode the two subgraphs into corresponding representations. Thus, the causal representations are drawn closer to the training objective while the biased representations are pulled away from it. Further, the Hilbert-Schmidt Independence Criterion is employed to strengthen the separability of the two subgraphs. Extensive experiments on six real-world graph datasets are conducted to demonstrate the superiority of the proposed FedATH over the compared approaches.
Paperid:263
Authors:Eric Zhao · Pranjal Awasthi · Sreenivas Gollapudi
Abstract:
Samplingbased search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one---typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts---chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.
Paperid:264
Authors:Quentin Fruytier · Aryan Mokhtari · Sujay Sanghavi
Abstract:
Abstract:Classical Mixtures of Experts (MoE) are Machine Learning models that involve partitioning the input space, with a separate "expert" model trained on each partition. Recently, MoEbased model architectures have become popular as a means to reduce training and inference costs. There, the partitioning function and the experts are both learnt jointly via gradient descent-type methods on the log-likelihood. In this paper we study theoretical guarantees of the Expectation Maximization (EM) algorithm for the training of MoE models. We first rigorously analyze EM for MoE where the conditional distribution of the target and latent variable conditioned on the feature variable belongs to an exponential family of distributions and show its equivalence to projected Mirror Descent with unit step size and a Kullback-Leibler Divergence regularizer. This perspective allows us to derive new convergence results and identify conditions for local linear convergence; In the special case of mixture of $2$ linear or logistic experts, we additionally provide guarantees for linear convergence based on the signal-to-noise ratio. Experiments on synthetic and (small-scale) real-world data supports that EM outperforms the gradient descent algorithm both in terms of convergence rate and the achieved accuracy.
Paperid:265
Authors:Seung Lee · Hojoon Kim · Yutack Park · Dawoon Jeong · Seungwu Han · Yeonhong Park · Jae W. Lee
Abstract:
Abstract:Machine Learning Interatomic Potentials (MLIPs) enable efficient molecular dynamics (MD) simulations with high accuracy. While equivariant MLIPs achieve stateof-the-art accuracy, they face significant computational bottlenecks centered around their Tensor-Product layer, which account for up to 75\% of training time and cause substantial memory overhead. We present FlashTP, a highly optimized tensor-product library that addresses these inefficiencies through kernel fusion, sparse computation, and path-aggregated execution. FlashTP achieves up to 29.1$\times$ and 24.3$\times$ kernel speedups over e3nn and NVIDIA cuEquivariance, respectively. For SevenNet-l3i5, it delivers 4.2$\times$ and 3.5$\times$ speedup while reducing peak memory usage by 6.3$\times$ and 6.2$\times$ for inference and training, respectively.
Paperid:266
Authors:Aya Kayal · Sattar Vakili · Laura Toni · Da-shan Shiu · Alberto Bernacchia
Abstract:
Abstract:Bayesian optimization (BO) with preferencebased feedback has recently garnered significant attention due to its emerging applications. We refer to this problem as Bayesian Optimization from Human Feedback (BOHF), which differs from conventional BO by learning the best actions from a reduced feedback model, where only the preference between two actions is revealed to the learner at each time step. The objective is to identify the best action using a limited number of preference queries, typically obtained through costly human feedback. Existing work, which adopts the Bradley-Terry-Luce (BTL) feedback model, provides regret bounds for the performance of several algorithms. In this work, within the same framework we develop tighter performance guarantees. Specifically, we derive regret bounds of $\tilde{\mathcal{O}}(\sqrt{\Gamma(T)T})$, where $\Gamma(T)$ represents the maximum information gain—a kernel-specific complexity term—and $T$ is the number of queries. Our results significantly improve upon existing bounds. Notably, for common kernels, we show that the order-optimal sample complexities of conventional BO—achieved with richer feedback models—are recovered. In other words, the same number of preferential samples as scalar-valued samples is sufficient to find a nearly optimal solution.
Paperid:267
Authors:Jiajun Zhu · Peihao Wang · Ruisi Cai · Jason Lee · Pan Li · Zhangyang “Atlas” Wang
Abstract:
Transformers rely on both contentbased and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose conTextualized equivariAntPositionEncoding (TAPE), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving robustness and adaptability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques.
Paperid:268
Authors:Mohamed Ali Souibgui · Changkyu Choi · Andrey Barsky · Kangsoo Jung · Ernest Valveny · Dimosthenis Karatzas
Abstract:
We propose DocVXQA, a novel framework for visually selfexplainable document question answering, where the goal is not only to produce accurate answers to questions but also to learn visual heatmaps that highlight critical regions, offering interpretable justifications for the model decision. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning criteria.Unlike conventional relevance map methods that solely emphasize regions relevant to the answer, our context-aware DocVXQA delivers explanations that are contextually sufficient yet representation-efficient. This fosters user trust while achieving a balance between predictive performance and interpretability in document visual question answering applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method.
Paperid:269
Authors:Chen Zhang · Weixin Bu · Zeyi Ren · Zhengwu Liu · Yik-Chung WU · Ngai Wong
Abstract:
Inferring properties of graphstructured data,e.g., the solubility of molecules, essentially involves learning the implicit mapping from graphs to their properties. This learning process is often costly for graph property learners like Graph Convolutional Networks (GCNs). To address this, we propose a paradigm called Graph Nonparametric Teaching (GraNT) that reinterprets the learning process through a novel nonparametric teaching perspective. Specifically, the latter offers a theoretical framework for teaching implicitly defined (i.e., nonparametric) mappings via example selection. Such an implicit mapping is realized by a dense set of graph-property pairs, with the GraNT teacher selecting a subset of them to promote faster convergence in GCN training. By analytically examining the impact of graph structure on parameter-based gradient descent during training, and recasting the evolution of GCNs—shaped by parameter updates—through functional gradient descent in nonparametric teaching, we showfor the first timethat teaching graph property learners (i.e., GCNs) is consistent with teaching structure-aware nonparametric learners. These new findings readily commit GraNT to enhancing learning efficiency of the graph property learner, showing significant reductions in training time for graph-level regression (-36.62\%), graph-level classification (-38.47\%), node-level regression (-30.97\%) and node-level classification (-47.30\%), all while maintaining its generalization performance.
Paperid:270
Authors:Yuxuan Wang · Mingzhou Liu · Xinwei Sun · Wei Wang · Yizhou Wang
Abstract:
Determining the direction of relationships between variables is fundamental for understanding complex systems across scientific domains. While observational data can uncover relationships between variables, it cannot distinguish between cause and effect without experimental interventions. To effectively uncover causality, previous works have proposed intervention strategies that sequentially optimize the intervention values. However, most of these approaches primarily maximized informationtheoretic gains that may not effectively measure the reliability of direction determination. In this paper, we formulate the causal direction identification as a hypothesis-testing problem, and propose a Bayes factor-based intervention strategy, which can quantify the evidence strength of one hypothesis (e.g., causal) over the other (e.g., non-causal). To balance the immediate and future gains of testing strength, we propose a sequential intervention objective over intervention values in multiple steps. By analyzing the objective function, we develop a dynamic programming algorithm that reduces the complexity from non-polynomial to polynomial. Experimental results on bivariate systems, tree-structured graphs, and an embodied AI environment demonstrate the effectiveness of our framework in direction determination and its extensibility to both multivariate settings and real-world applications.
Paperid:271
Authors:Jonggeon Park · Giung Nam · Hyunsu Kim · Jongmin Yoon · Juho Lee
Abstract:
Neural network ensembles have proven effective in improving performance across a range of tasks; however, their high computational cost limits their applicability in resourceconstrained environments or for large models. Ensemble distillation, the process of transferring knowledge from an ensemble teacher to a smaller student model, offers a promising solution to this challenge. The key is to ensure that the student model is both cost-efficient and achieves performance comparable to the ensemble teacher. With this in mind, we propose a novel ensemble distribution distillation method, which leverages flow matching to effectively transfer the diversity from the ensemble teacher to the student model. Our extensive experiments demonstrate the effectiveness of our proposed method compared to existing ensemble distillation approaches.
Paperid:272
Authors:Yu Sun · Xinhao Li · Karan Dalal · Jiarui Xu · Arjun Vikram · Genghan Zhang · Yann Dubois · Xinlei Chen · Xiaolong Wang · Sanmi Koyejo · Tatsunori Hashimoto · Carlos Guestrin
Abstract:
Selfattention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. Inspired by prior work (Schlag et al., 2021), we present a practical framework to instantiate sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Both TTT-Linear and TTT-MLP match or exceed the baselines. Similar to Transformer, they can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. With preliminary systems optimization, TTT-Linear is already faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.
Paperid:273
Authors:Sheheryar Mehmood · Peter Ochs
Abstract:
Numerous optimization algorithms have a timevarying update rule thanks to, for instance, a changing step size, momentum parameter or, Hessian approximation. In this paper, we apply unrolled or automatic differentiation to a time-varying iterative process and provide convergence (rate) guarantees for the resulting derivative iterates. We adapt these convergence results and apply them to proximal gradient descent with variable step size and FISTA when solving partly-smooth problems. We test the convergence (rates) of these algorithms numerically through several experiments. Our theoretical and numerical results show that the convergence rate of the algorithm is reflected in its derivative iterates.
Paperid:274
Authors:Hyeong Kyu Choi · Maxim Khanov · Hongxin Wei · Sharon Li
Abstract:
Dataset contamination, where evaluation datasets overlap with pretraining corpora, inflates performance metrics and undermines the reliability of model evaluations. Quantifying dataset contamination thus becomes essential to ensure that performance evaluations genuinely reflect a model's ability to generalize to unseen data, rather than relying on memorized examples. To address this problem, we propose Kernel Divergence Score (KDS), a novel method that quantifies dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings, before and after fine-tuning on the benchmark dataset. Leveraging the insight that fine-tuning affects unseen samples more significantly than seen ones, KDS provides a reliable measure of contamination. Through extensive experiments on controlled contamination scenarios, KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines. Additionally, we perform comprehensive ablation studies to analyze the impact of key design choices, providing deeper insights into the components and effectiveness of KDS. These ablations highlight the importance of leveraging fine-grained kernel-based information and confirm the reliability of the proposed framework across diverse datasets and settings.
Paperid:275
Authors:Shibo Jie · Yehui Tang · Kai Han · Yitong Li · Duyu Tang · Zhi-Hong Deng · Yunhe Wang
Abstract:
Mixtureof-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLoE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLoE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert's computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLoE achieves inference speeds comparable to dense models and significantly outperforms MoE with offloaded experts, while maintaining performance on par with MoE.
Paperid:276
Authors:Ren-Biao Liu · Anqi Li · ChaodingYang · Hui Sun · Ming Li
Abstract:
Large Language Models (LLMs) have demonstrated exceptional performance in code generation, becoming increasingly vital for software engineering and development. Recently, Chainof-Thought (CoT) has proven effective for complex reasoning tasks by prompting LLMs to reason step-by-step and provide a final answer. However, research onhow LLMs learn to reason with CoT data for code generationremains limited. In this work, we revisit classic CoT training, which typically organizes reasoning steps before the final answer.We synthesize a dataset to distinguish the CoT process and code solution and then conduct extensive experiments to study how CoT works in code generation empirically.We observe counterfactual findings, indicating that the classic training paradigm does not yet benefit code generation. Instead, training LLMs to generate code first and then generate the CoT to explain reasoning steps during code generation is more effective.Specifically, our results indicate that a 9.9\% relative performance improvement can be achieved simply by changing the order between CoT and code.Our findings offer meaningful insights into leveraging CoT to improve the reasoning ability of code LLMs and enhance code generation.
Paperid:277
Authors:Muhammed Ustaomeroglu · Guannan Qu
Abstract:
Abstract:Selfattention has emerged as a core component of modern neural architectures, yet its theoretical underpinnings remain elusive. In this paper, we study self-attention through the lens of *interacting entities*, ranging from agents in multi-agent reinforcement learning to alleles in genetic sequences, and show that a single layer linear self-attention can *efficiently* represent, learn, and generalize functions capturing pairwise interactions, including out-of-distribution scenarios. Our analysis reveals that self-attention acts as a *mutual interaction learner* under minimal assumptions on the diversity of interaction patterns observed during training, thereby encompassing a wide variety of real-world domains. In addition, we validate our theoretical insights through experiments demonstrating that self-attention, along with our proposed extensions, effectively learns interaction functions and generalizes across both population distributions and out-of-distribution scenarios. Extending self-attention, we introduce *HyperFeatureAttention*, a novel neural network module designed to learn *couplings of different feature-level interactions* between entities. Furthermore, we propose *HyperAttention*, a new mechanism that extends beyond pairwise interactions to capture multi-entity dependencies, such as three-way, four-way, or general $n$-way interactions.
Paperid:278
Authors:Ben Dai
Abstract:
Empirical risk minimization (ERM) with a computationally feasible surrogate loss is a widely accepted approach for classification. Notably, the convexity and calibration (CC) properties of a loss function ensure consistency of ERM in maximizing accuracy, thereby offering a wide range of options for surrogate losses. In this article, we propose a novel ensemble method, namely EnsLoss, which extends the ensemble learning concept to combine loss functions within the ERM framework. A key feature of our method is the consideration on preserving the "legitimacy'' of the combined losses, i.e., ensuring the CC properties. Specifically, we first transform the CC conditions of losses into lossderivatives, thereby bypassing the need for explicit loss functions and directly generating calibrated loss-derivatives. Therefore, inspired by Dropout, EnsLoss enables loss ensembles through one training process with doubly stochastic gradient descent (i.e., random batch samples and random calibrated loss-derivatives). We theoretically establish the statistical consistency of our approach and provide insights into its benefits. The numerical effectiveness of EnsLoss compared to fixed loss methods is demonstrated through experiments on a broad range of 45 pairs of CIFAR10 datasets, the PCam image dataset, and 14 OpenML tabular datasets and with various deep learning architectures. Python repository and source code are available on anonymous GitHub at (https://anonymous.4open.science/r/ensLoss-53D1).
Paperid:279
Authors:Debangshu Banerjee · Tarun Suresh · Shubham Ugare · Sasa Misailovic · Gagandeep Singh
Abstract:
Code generation, symbolic math reasoning, and other tasks require LLMs to produce outputs that are both syntactically and semantically correct. Constrained LLM generation is a promising direction to enforce adherence to formal grammar, but prior works have empirically observed that strict enforcement of formal constraints often diminishes the reasoning capabilities of LLMs. In this work, we first provide a theoretical explanation for why constraining LLM outputs to very restrictive grammars that only allow syntactically valid final answers reduces the reasoning capabilities of the model. Second, we demonstrate that by augmenting the output grammar with carefully designed additional rules, it is always possible to preserve the reasoning capabilities of the LLM while ensuring syntactic and semantic correctness in its outputs. Building on these theoretical insights, we propose a reasoningaugmented constrained decoding algorithm, CRANE, which effectively balances the correctness of constrained generation with the flexibility of unconstrained generation. Experiments on multiple open-source LLMs and benchmarks show that CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding, showing up to a 9\% improvement over baselines on challenging symbolic mathematical benchmarks GSM-symbolic and FOLIO.
Paperid:280
Authors:EMANUELE SANSONE · Tim Lebailly · Tinne Tuytelaars
Abstract:
We present a principled and simplified design of the projector and loss function for noncontrastive self-supervised learning based on hyperdimensional computing. We theoretically demonstrate that this design introduces an inductive bias that encourages representations to be simultaneously decorrelated and clustered, without explicitly enforcing these properties. This bias provably enhances generalization and suffices to avoid known training failure modes, such as representation, dimensional, cluster, and intracluster collapses. We validate our theoretical findings on image datasets, including SVHN, CIFAR-10, CIFAR-100, and ImageNet-100. Our approach effectively combines the strengths of feature decorrelation and cluster-based self-supervised learning methods, overcoming training failure modes while achieving strong generalization in clustering and linear classification tasks.
Paperid:281
Authors:Abhijeet Mulgund · Chirag Pabbaraju
Abstract:
Abstract:The paradigm of weakto-strong generalization constitutes the training of a strong AI model on data labeled by a weak AI model, with the goal that the strong model nevertheless outperforms its weak supervisor on the target task of interest. For the setting of real-valued regression with the squared loss, recent work quantitatively characterizes the gain in performance of the strong model over the weak model in terms of the misfit between the strong and weak model. We generalize such a characterization to learning tasks whose loss functions correspond to arbitrary Bregman divergences when the strong class is convex. This extends the misfit-based characterization of performance gain in weak-to-strong generalization to classification tasks, as the cross-entropy loss can be expressed in terms of a Bregman divergence. In most practical scenarios, however, the strong model class may not be convex. We therefore weaken this assumption and study weak-to-strong generalization for convex combinations of $k$ strong models in the strong class, in the concrete setting of classification. This allows us to obtain a similar misfit-based characterization of performance gain, up to an additional error term that vanishes as $k$ gets large. Our theoretical findings are supported by thorough experiments on synthetic as well as real-world datasets.
Paperid:282
Authors:Guoqiang Zhang · John Lewis · W. Bastiaan Kleijn
Abstract:
Abstract:Various reversible deep neural networks (DNN) models have been proposed to reduce memory consumption in the training process. However, almost all existing reversible DNNs either require special nonstandard architectures or are constructed by modifying existing DNN architectures considerably to enable reversibility. In this work we present the BDIA-transformer, which is an exact bit-level reversible transformer that uses an unchanged standard architecture for inference. The basic idea is to first treat each transformer block as the Euler integration approximation for solving an ordinary differential equation (ODE) and then incorporate the technique of bidirectional integration approximation (BDIA) (originally designed for diffusion inversion) into the neural architecture, together with activation quantization to make it exactly bit-level reversible. In the training process, we let a hyper-parameter $\gamma$ in BDIA-transformer randomly take one of the two values $\{0.5, -0.5\}$ per training sample per transformer block for averaging every two consecutive integration approximations. As a result, BDIA-transformer can be viewed as training an ensemble of ODE solvers parameterized by a set of binary random variables, which regularizes the model and results in improved validation accuracy. Lightweight side information per transformer block is required to be stored in the forward process to account for binary quantization loss to enable exact bit-level reversibility. In the inference procedure, the expectation $\mathbb{E}(\gamma)=0$ is taken to make the resulting architectures of BDIA-transformer identical to transformers up to activation quantization. Our experiments in image classification, natural language generation, and language translation show that BDIA-transformers outperform their conventional counterparts significantly in terms of validation performance due to the regularization effect of the set of $\gamma$ random variables while also requiring considerably less training memory.
Paperid:283
Authors:Qiang Li · Michal Yemini · Hoi To Wai
Abstract:
This paper studies the convergence of clipped stochastic gradient descent (SGD) algorithms with decisiondependent data distribution. Our setting is motivated by privacy preserving optimization algorithms that interact with performative data where the prediction models can influence future outcomes. This challenging setting involves the non-smooth clipping operator and non-gradient dynamics due to distribution shifts. We make two contributions in pursuit for a performative stable solution with these algorithms. First, we characterize the stochastic bias with projected clipped SGD (PCSGD) algorithm which is caused by the clipping operator that prevents PCSGD from reaching a stable solution. When the loss function is strongly convex, we quantify the lower and upper bounds for this stochastic bias and demonstrate a bias amplification phenomenon with the sensitivity of data distribution. When the loss function is non-convex, we bound the magnitude of stationarity bias. Second, we propose remedies to mitigate the bias either by utilizing an optimal step size design for PCSGD, or to apply the recent DiceSGD algorithm [Zhang et al., 2024]. Our analysis is also extended to show that the latter algorithm is free from stochastic bias in the performative setting. Numerical experiments verify our findings.
Paperid:284
Authors:Wenbo Pan · Zhichao Liu · Qiguang Chen · Xiangyang Zhou · Yu Haining · Xiaohua Jia
Abstract:
Large Language Models' safetyaligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective.
Paperid:285
Authors:GUOGUO AI · Guansong Pang · Hezhe Qiao · YuanGao · Hui Yan
Abstract:
Graph Transformers (GTs) have demonstrated remarkable performance in graph representation learning over popular graph neural networks (GNNs). However, selfattention, the core module of GTs, preserves only low-frequency signals in graph features, leading to ineffectiveness in capturing other important signals like high-frequency ones. Some recent GT models help alleviate this issue, but their flexibility and expressiveness are still limited since the filters they learn are fixed on predefined graph spectrum or order. To tackle this challenge, we propose a Graph Fourier Kolmogorov-Arnold Transformer (GrokFormer), a novel GT model that learns highly expressive spectral filters with adaptive graph spectrum and order through a Fourier series modeling over learnable activation functions. We demonstrate theoretically and empirically that the proposed GrokFormer filter offers better expressiveness than other spectral methods. Comprehensive experiments on 10 real-world node classification datasets across various domains, scales, and graph properties, as well as 5 graph classification datasets, show that GrokFormer outperforms state-of-the-art GTs and GNNs. Our code is available at https://anonymous.4open.science/r/GrokFormer-1DCE.
Paperid:286
Authors:Feiyang Wang · Xingquan Zuo · Hai Huang · Gang Chen
Abstract:
Abstract:A key challenge in blackbox adversarial attacks is the high query complexity in hard-label settings, where only the top-1 predicted label from the target deep model is accessible. In this paper, we propose a novel normal-vector-based method called Two-third Bridge Attack (TtBA). A innovative bridge direction is introduced which is a weighted combination of the current unit perturbation direction and its unit normal vector, controlled by a weight parameter $k$. We further use binary search to identify $k=k_\text{bridge}$, which has identical decision boundary as the current direction. Notably, we observe that $k=2/3 k_\text{bridge}$ yields a near-optimal perturbation direction, ensuring the stealthiness of the attack. In addition, we investigate the critical importance of local optima during the perturbation direction optimization process and propose a simple and effective approach to detect and escape such local optima. Experimental results on MNIST, FASHION-MNIST, CIFAR10, CIFAR100, and ImageNet datasets demonstrate the strong performance and scalability of our approach. Compared to state-of-the-art non-targeted and targeted attack methods, TtBA consistently delivers superior performance across most experimented datasets and deep learning models. Code is available at https://anonymous.4open.science/r/TtBA-6ECF.
Paperid:287
Authors:Yuefan Cao · Xiaoyu Li · Yingyiu Liang · Zhizhou Sha · Zhenmei Shi · Zhao Song · Jiahao Zhang
Abstract:
As AI research surges in both impact and volume, conferences have imposed submission limits to maintain paper quality and alleviate organizational pressure. In this work, we examine the fairness of deskrejection systems under submission limits and reveal that existing practices can result in substantial inequities. Specifically, we formally define the paper submission limit problem and identify a critical dilemma: when the number of authors exceeds three, it becomes impossible to reject papers solely based on excessive submissions without negatively impacting innocent authors. Thus, this issue may unfairly affect early-career researchers, as their submissions may be penalized due to co-authors with significantly higher submission counts, while senior researchers with numerous papers face minimal consequences. To address this, we propose an optimization-based fairness-aware desk-rejection mechanism and formally define two fairness metrics: individual fairness and group fairness. We prove that optimizing individual fairness is NP-hard, whereas group fairness can be efficiently optimized via linear programming. Through case studies, we demonstrate that our proposed system ensures greater equity than existing methods, including those used in CVPR 2025, offering a more socially just approach to managing excessive submissions in AI conferences.
Paperid:288
Authors:Atefeh Gilani · Felipe Gomez · Shahab Asoodeh · Flavio Calmon · Oliver Kosut · Lalitha Sankar
Abstract:
Abstract:We propose a unified optimization framework for designing continuous and discrete noise distributions that ensure differential privacy (DP) by minimizing R\'enyi DP, a variant of DP, under a cost constraint. R\'enyi DP has the advantage that by considering different values of the R\'enyi parameter $\alpha$, we can tailor our optimization for any number of compositions. To solve the optimization problem, we reduce it to a finitedimensional convex formulation and perform preconditioned gradient descent. The resulting noise distributions are then compared to their Gaussian and Laplace counterparts. Numerical results demonstrate that our optimized distributions are consistently better, with significant improvements in $(\varepsilon, \delta)$-DP guarantees in the moderate composition regimes, compared to Gaussian and Laplace distributions with the same variance.
Paperid:289
Authors:Binchi Zhang · Zaiyi Zheng · Zhengzhang Chen · Jundong Li
Abstract:
Symmetry in the parameter space of deep neural networks (DNNs) has proven beneficial for various deep learning applications. A wellknown example is the permutation symmetry in Multi-Layer Perceptrons (MLPs), where permuting the rows of weight matrices in one layer and applying the inverse permutation to adjacent layers yields a functionally equivalent model. While permutation symmetry fully characterizes the equivalence set for MLPs, its discrete nature limits its utility for transformers. In this paper, we introduce rotation symmetry, a novel form of parameter space symmetry for transformers that generalizes permutation symmetry by rotating parameter matrices in self-attention layers. Unlike permutation symmetry, rotation symmetry operates in a continuous domain, thereby significantly expanding the equivalence set for transformers. Based on this property, we propose a theoretically optimal parameter matching algorithm as a plug-and-play module to enhance model fusion. We evaluate our approach using pre-trained transformers across diverse natural language and vision tasks. Experimental results demonstrate that our rotation symmetry-based matching algorithm substantially improves model fusion, highlighting the potential of parameter space symmetry to facilitate model fusion. Our code is available on https://anonymous.4open.science/r/TransformerPermutation-8800.
Paperid:290
Authors:Alex Vitvitskyi · João Madeira Araujo · Marc Lackenby · Petar Veličković
Abstract:
As implied by the plethora of literature on graph rewiring, the choice of computational graph employed by a neural network can make a significant impact on its downstream performance. Certain effects related to the computational graph, such as underreaching and over-squashing, may even render the model incapable of learning certain functions. Most of these effects have only been thoroughly studied in the domain of undirected graphs; however, recent years have seen a significant rise in interest in feedforward computational graphs: directed graphs without any back edges. In this paper, we study the desirable properties of a feedforward computational graph, discovering two important complementary measures: fidelity and mixing time, and evaluating a few popular choices of graphs through the lens of these measures. Our study is backed by both theoretical analyses of the metrics' asymptotic behaviour for various graphs, as well as correlating these metrics to the performance of trained neural network models using the corresponding graphs.
Paperid:291
Authors:Ang Cao · Sergio Arnaud · Oleksandr Maksymets · Jianing Yang · Ayush Jain · Ada Martin · Vincent-Pierre Berges · Paul McVay · Ruslan Partsey · Aravind Rajeswaran · Franziska Meier · Justin Johnson · Jeong Joon Park · Alexander Sax
Abstract:
Our approach to training 3D visionlanguage understanding models is to train a feedforward model that makes predictions in 3D, but never requires 3D labels and is supervised only in 2D, using 2D losses and differentiable rendering.The approach is new for vision-language understanding. By treating the reconstruction as a ``latent variable'', we can render the outputs without placing unnecessary constraints on the network architecture (e.g. can be used with decoder-only models).For training, only need images and camera pose, and 2D labels. We show that we can even remove the need for 2D labels by using pseudo-labels from pretrained 2D models. We demonstrate this to pretrain a network, and we finetune it for 3D vision-language understanding tasks. We show this approach outperforms baselines/sota for 3D vision-language grounding, and also outperforms other 3D pretraining techniques.
Paperid:292
Authors:Ihab Bendidi · Yassir El Mesbahi · Alisandra Denton · Karush Suri · Kian Kenyon-Dean · Auguste Genovesio · Emmanuel Noutahi
Abstract:
Understanding cellular responses to stimuli is crucial for biological discovery and drug development. Transcriptomics provides interpretable, genelevel insights, while microscopy imaging offers rich predictive features but is harder to interpret. Weakly paired datasets, where samples share biological states, enable multimodal learning but are scarce, limiting their utility for training and multimodal inference. We propose a framework to enhance transcriptomics by distilling knowledge from microscopy images. Using weakly paired data, our method aligns and binds modalities, enriching gene expression representations with morphological information. To address data scarcity, we introduce (1)Semi-Clipped, an adaptation of CLIP for cross-modal distillation using pretrained foundation models, achieving state-of-the-art results, and (2)PEA(PerturbationEmbeddingAugmentation), a novel augmentation technique that enhances transcriptomics data while preserving inherent biological information. These strategies improve the predictive power and retain the interpretability of transcriptomics, enabling rich unimodal representations for complex biological tasks.
Paperid:293
Authors:Thomas Massena · Léo Andéol · Thibaut Boissin · Franck Mamalet · Corentin FRIEDRICH · Mathieu Serrurier · Sébastien Gerchinovitz
Abstract:
Conformal Prediction (CP) has proven to be an effective posthoc method for improving the trustworthiness of neural networks by providing prediction sets with finite-sample guarantees. However, under adversarial attacks, classical conformal guarantees do not hold anymore: this problem is addressed in the field of Robust Conformal Prediction. Several methods have been proposed to provide robust CP sets with guarantees under adversarial perturbations, but, for large scale problems, these sets are either too large or the methods are too computationally demanding to be deployed in real life scenarios. In this work, we propose a new method that leverages Lipschitz-bounded networks to precisely and efficiently estimate robust CP sets. When combined with a 1-Lipschitz robust network, we demonstrate that our lip-rcp method outperforms state-of-the-art results in both the size of the robust CP sets and computational efficiency in medium and large-scale scenarios such as ImageNet. Taking a different angle, we also study vanilla CP under attack, and derive new worst-case coverage bounds of vanilla CP sets, which are valid simultaneously for all adversarial attack levels. Our lip-rcp method makes this second approach very efficient as well, as shown in the experiments.
Paperid:294
Authors:Zheng Gong · Ying Sun
Abstract:
Abstract:Discrete Graph Diffusion Models (DGDMs) mark a pivotal advancement in graph generation, effectively preserving sparsity and structural integrity, thereby enhancing the learning of graph data distributions for diverse generative applications. Despite their potential, DGDMs are computationally intensive due to the numerous lowparameter yet high-computation operations, thereby increasing the need of inference acceleration. A promising solution to mitigate this issue is model quantization. However, existing quantization techniques for Image Diffusion Models (IDMs) face limitations in DGDMs due to differing diffusion processes, while Large Language Model (LLM) quantization focuses on reducing memory access latency of loading large parameters, unlike DGDMs, where inference bottlenecks are computations due to smaller model sizes. To fill this gap, we introduce Bit-DGDM, a post-training quantization framework for DGDMs which incorporates two novel ideas: (i) sparse-dense activation quantization sparsely modeling the activation outliers through adaptively selected, data-free thresholds in full-precision and quantizing the remaining to low-bit, and (ii) ill-conditioned low-rank decomposition decomposing the weights into low-rank component enable faster inference and an $\alpha$-sparsity matrix that models outliers. Extensive experiments demonstrate that Bit-DGDM not only reducing the memory usage from the FP32 baseline by up to $2.8\times$ and achieve up to $2.5\times$ speedup, but also achieve comparable performance to ultra-low precision of up to 4-bit.
Paperid:295
Authors:Yufei Guo · Yuhan Zhang · Zhou Jie · Xiaode Liu · Xin Tong · Yuanpei Chen · Weihang Peng · Zhe Ma
Abstract:
The Spiking Neural Network (SNN), a biologically inspired neural network infrastructure, has garnered significant attention recently. SNNs utilize binary spike activations for efficient information transmission, replacing multiplications with additions, thereby enhancing energy efficiency. However, binary spike activation maps often fail to capture sufficient data information, resulting in reduced accuracy.To address this challenge, we advocate reversing the bit of the weight and activation, called \textbf{ReverB}, inspired by recent findings that highlight greater accuracy degradation from quantizing activations compared to weights. Specifically, our method employs realvalued spike activations alongside binary weights in SNNs. This preserves the event-driven and multiplication-free advantages of standard SNNs while enhancing the information capacity of activations.Additionally, we introduce a trainable factor within binary weights to adaptively learn suitable weight amplitudes during training, thereby increasing network capacity. To maintain efficiency akin to vanilla \textbf{ReverB}, our trainable binary weight SNNs are converted back to standard form using a re-parameterization technique during inference.Extensive experiments across various network architectures and datasets, both static and dynamic, demonstrate that our approach consistently outperforms state-of-the-art methods.
Paperid:296
Authors:Harit Vishwakarma · Yi Chen · Satya Sai Srinath Namburi GNVV · Sui Jiet Tay · Ramya Vinayak · Frederic Sala
Abstract:
Modern semisupervised learning (SSL) methods rely on pseudolabeling and consistency regularization. Pseudolabeling is typically performed by comparing the model's confidence scores and a predefined threshold. While several heuristics have been proposed to improve threshold selection, the underlying issues of overconfidence and miscalibration in confidence scores remain largely unaddressed, leading to inaccurate pseudolabels, degraded test accuracy, and prolonged training. We take a first-principles approach to learn confidence scores and thresholds with an explicit knob for error. This flexible framework addresses the fundamental question of optimal scores and threshold selection in pseudolabeling. Moreover, it gives practitioners a principled way to control the quality and quantity of pseudolabels. Such control is vital in SSL, where balancing pseudolabel quality and quantity directly affects model performance and training efficiency. Our experiments show that, by integrating this framework with modern SSL methods, we achieve significant improvements in accuracy and training efficiency. In addition, we provide novel insights on the trade-offs between the choices of the error parameter and the end model's performance.
Paperid:297
Authors:Ansgar Lößer · Max Schlecht · Florian Schintke · Joel Witzke · Matthias Weidlich · Björn Scheuermann
Abstract:
Abstract:Segmented regression is a statistical method that approximates a function $f$ by a piecewise function $\hat{f}$ using noisy data samples.*Min$\epsilon$* approaches aim to reduce the regression function's mean squared error (MSE) for a given number of $k$ segments.An optimal solution for *min-$\epsilon$* segmented regression is found in $\mathcal{O}(n^2)$ time (Bai & Perron, 1998; Yamamoto & Perron, 2013) for $n$ samples. For large datasets, current heuristics improve time complexity to $\mathcal{O}(n\log{n})$ (Acharya et al., 2016) but can result in large errors, especially when exactly $k$ segments are used.We present a method for *min-$\epsilon$* segmented regression that combines the scalability of top existing heuristic solutions with a statistical efficiency similar to the optimal solution. This is achieved by using a new method to merge an initial set of segments using precomputed matrices from samples, allowing both merging and error calculation in constant time.Our approach, using the same samples and parameter $k$, produces segments with up to 1,000 times lower MSE compared to Acharya et al. (2016) in about 100 times less runtime on data sets over $10^4$ samples.
Paperid:298
Authors:Yi Yu · Song Xia · SIYUAN YANG · Chenqi KONG · Wenhan Yang · Shijian Lu · Yap-peng Tan · Alex Kot
Abstract:
Most existing unlearnable strategies focus on preventing unauthorized users from training singletask learning (STL) models with personal data. Nevertheless, the paradigm has recently shifted towards multi-task data and multi-task learning (MTL), targeting generalist and foundation models that can handle multiple tasks simultaneously. Despite their growing importance, MTL data and models have been largely neglected while pursuing unlearnable strategies. This paper presents MTL-UE, the first unified framework for generating unlearnable examples for multi-task data and MTL models. Instead of optimizing perturbations for each sample, we design a generator-based structure that introduces label priors and class-wise feature embeddings which leads to much better attacking performance. In addition, MTL-UE incorporates intra-task and inter-task embedding regularization to increase inter-class separation and suppress intra-class variance which enhances the attack robustness greatly. Furthermore, MTL-UE is versatile with good supports for dense prediction tasks in MTL. It is also plug-and-play allowing integrating existing surrogate-dependent unlearnable methods with little adaptation. Extensive experiments show that MTL-UE achieves superior attacking performance consistently across 4 MTL datasets, 3 base UE methods, 5 model backbones, and 5 MTL task-weighting strategies.
Paperid:299
Authors:Linxi Zhao · Yihe Deng · Weitong Zhang · Quanquan Gu
Abstract:
The advancement of Large VisionLanguage Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs (e.g., GPT-3.5) to rectify the outputs of LVLMs. However, these approaches require either expensive training/fine-tuning or API access to advanced LLMs for post-generation correction. In response to these limitations, we propose Mitigating hallucinAtion via image-gRounded guIdaNcE (MARINE), a framework that is both training-free and API-free. MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs. This is achieved by leveraging open-source vision models to extract object-level information, thereby enhancing the precision of LVLM-generated content. Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance. Through comprehensive evaluations across popular LVLMs with diverse evaluation metrics and benchmarks, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it reduces hallucinations consistently in GPT-4V-assisted evaluation while maintaining the detailedness of LVLMs' generations.
Paperid:300
Authors:Laura Zheng · Wenjie Wei · Tony Wu · Jacob Clements · Shreelekha Revankar · Andre Harrison · Yu Shen · Ming Lin
Abstract:
Achieving robustness in image segmentation models is challenging due to the finegrained nature of pixel-level classification. These models, which are crucial for many real-time perception applications, particularly struggle when faced with natural corruptions in the wild for autonomous systems. While sensitivity analysis can help us understand how input variables influence model outputs, its application to natural and uncontrollable corruptions in training data is computationally expensive. In this work, we present an adaptive, sensitivity-guided augmentation method to enhance robustness against natural corruptions. Our sensitivity analysis on average runs 10 times faster and requires about 200 times less storage than previous sensitivity analysis, enabling practical, on-the-fly estimation during training for a model-free augmentation policy. With minimal fine-tuning, our sensitivity-guided augmentation method achieves improved robustness on both real-world and synthetic datasets compared to state-of-the-art data augmentation techniques in image segmentation.
Paperid:301
Authors:Martino Bernasconi · Matteo Castiglioni · Andrea Celli
Abstract:
Abstract:In the bandits with knapsacks framework (BwK) the learner has $m$ resourceconsumption (i.e., packing) constraints. We focus on the generalization of BwK in which the learner has a set of general long-term constraints. The goal of the learner is to maximize their cumulative reward, while at the same time achieving small cumulative constraints violations. In this scenario, there exist simple instances where conventional methods for BwK fail to yield sublinear violations of constraints. We show that it is possible to circumvent this issue by requiring the primal and dual algorithm to be weakly adaptive. Indeed, even without any information on the Slater's parameter $\rho$ characterizing the problem, the interaction between weakly adaptive primal and dual regret minimizers leads to a ``self-bounding'' behavior of dual variables. In particular, their norm remains suitably upper bounded across the entire time horizon even without explicit projection steps. By exploiting this property, we provide best-of-both-worlds guarantees for stochastic and adversarial inputs. In the first case, we show that the algorithm guarantees sublinear regret. In the latter case, we establish a tight competitive ratio of $\rho/(1+\rho)$. In both settings, constraints violations are guaranteed to be sublinear in time. Finally, this results allow us to obtain new result for the problem of contextual bandits with linear constraints, providing the first no-$\alpha$-regret guarantees for adversarial contexts.
Paperid:302
Authors:XINMIN QIU · Gege Chen · Bonan Li · Congying Han · Tiande Guo · Zicheng Zhang
Abstract:
Blind face restoration (BFR), which involves converting lowquality (LQ) images into high-quality (HQ) images, remains challenging due to complex and unknown degradations. While previous diffusion-based methods utilize feature extractors from LQ images as guidance, using raw LQ images directly as the starting point for the reverse diffusion process offers a theoretically optimal solution. In this work, we propose Pseudo-Hashing Image-to-image Schrödinger Bridge (P-I2SB), a novel framework inspired by optimal mass transport problems, which enhances the restoration potential of Schrödinger Bridge (SB) by correcting data distributions and effectively learning the optimal transport path between any two data distributions. Notably, we theoretically explore and identify that existing methods are limited by the optimality and reversibility of solutions in SB, leading to suboptimal performance. Our approach involves preprocessing HQ images during training by hashing them into pseudo-samples according to a rule related to LQ images, ensuring structural similarity in distribution. This guarantees optimal and reversible solutions in SB, enabling the inference process to learn effectively and allowing P-I2SB to achieve state-of-the-art results in BFR, with more natural textures and retained inference speed compared to previous methods.
Paperid:303
Authors:Daiheng Gao · Shilin Lu · Wenbo Zhou · Jiaming Chu · Jie Zhang · Mengxi Jia · Bang Zhang · Zhaoxin Fan · Weiming Zhang
Abstract:
Removing unwanted concepts from largescale text-to-image (T2I) diffusion models while maintaining their overall generative quality remains an open challenge. This difficulty is especially pronounced in emerging paradigms, such as Stable Diffusion (SD) v3 and Flux, which incorporate flow matching and transformer-based architectures. These advancements limit the transferability of existing concept-erasure techniques that were originally designed for the previous T2I paradigm (e.g., SD v1.4). In this work, we introduce EraseAnything, the first method specifically developed to address concept erasure within the latest flow-based T2I framework. We formulate concept erasure as a bi-level optimization problem, employing LoRA-based parameter tuning and an attention map regularizer to selectively suppress undesirable activations. Furthermore, we propose a self-contrastive learning strategy to ensure that removing unwanted concepts does not inadvertently harm performance on unrelated ones. Experimental results demonstrate that EraseAnything successfully fills the research gap left by earlier methods in this new T2I paradigm, achieving state-of-the-art performance across a wide range of concept erasure tasks.
Paperid:304
Authors:Maximilian Müller · Matthias Hein
Abstract:
Abstract:Detecting outof-distribution (OOD) examples is an important task for deploying reliable machine learning models in safety-critial applications. While post-hoc methods based on the Mahalanobis distance applied to pre-logit features are among the most effective for ImageNet-scale OOD detection, their performance varies significantly across models. We connect this inconsistency to strong variations in feature norms, indicating severe violations of the Gaussian assumption underlying the Mahalanobis distance estimation. We show that simple $\ell_2$-normalization of the features mitigates this problem effectively, aligning better with the premise of normally distributed data with shared covariance matrix. Extensive experiments on 44 models across diverse architectures and pretraining schemes show that $\ell_2$-normalization improves the conventional Mahalanobis distance-based approaches significantly and consistently, and outperforms other recently proposed OOD detection methods.
Paperid:305
Authors:Guanghui Wang · Zhiyong Yang · Zitai Wang · Shi Wang · Qianqian Xu · Qingming Huang
Abstract:
Abstract:Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model by minimizing the divergence between their output distributions, typically using forward KullbackLeibler divergence (FKLD) or reverse KLD (RKLD). It has become an effective training paradigm due to the broader supervision information provided by the teacher distribution compared to one-hot labels. We identify that the core challenge in KD lies in balancing two mode-concentration effects: the \textbf{\textit{Hardness-Concentration}} effect, which refers to focusing on modes with large errors, and the \textbf{\textit{Confidence-Concentration}} effect, which refers to focusing on modes with high student confidence. Through an analysis of how probabilities are reassigned during gradient updates, we observe that these two effects are entangled in FKLD and RKLD, but in extreme forms. Specifically, both are too weak in FKLD, causing the student to fail to concentrate on the target class. In contrast, both are too strong in RKLD, causing the student to overly emphasize the target class while ignoring the broader distributional information from the teacher. To address this imbalance, we propose ABKD, a generic framework with $\alpha$-$\beta$-divergence. Our theoretical results show that ABKD offers a smooth interpolation between FKLD and RKLD, achieving a better trade-off between these effects. Extensive experiments on 17 language/vision datasets with 12 teacher-student settings confirm its efficacy.
Paperid:306
Authors:Yiding Chen · Yiyi Zhang · Owen Oertell · Wen Sun
Abstract:
Diffusion models accomplish remarkable success in data generation tasks across various domains. However, the iterative sampling process is computationally expensive. Consistency models are proposed to learn consistency functions to map from noise to data directly, which allows onestep fast data generation and multistep sampling to improve sample quality. In this paper, we study the convergence of consistency models when the self-consistency property holds approximately under the training distribution. Our analysis requires only mild data assumption and applies to a family of forward processes. When the target data distribution has bounded support or has tails that decay sufficiently fast, we show that the samples generated by the consistency model are close to the target distribution in Wasserstein distance; when the target distribution satisfies some smoothness assumption, we show that with an additional perturbation step for smoothing, the generated samples are close to the target distribution in total variation distance. We provide two case studies with commonly chosen forward processes to demonstrate the benefit of multistep sampling.
Paperid:307
Authors:Shiba Biswal · Karthik Elamvazhuthi · Rishi Sonthalia
Abstract:
Abstract:This paper investigates the use of transformers to approximate the meanfield dynamics of interacting particle systems exhibiting collective behavior. Such systems are fundamental in modeling phenomena across physics, biology, and engineering, including opinion formation, biological networks, and swarm robotics. The key characteristic of these systems is that the particles are indistinguishable, leading to permutation-equivariant dynamics. First, we empirically demonstrate that transformers are well-suited for approximating a variety of mean field models, including the Cucker-Smale model for flocking and milling, and the mean-field system for training two-layer neural networks. We validate our numerical experiments via mathematical theory. Specifically, we prove that if a finite-dimensional transformer effectively approximates the finite-dimensional vector field governing the particle system, then the $L_\infty$ distance between the \textit{expected transformer} and the infinite-dimensional mean-field vector field can be bounded by a function of the number of particles observed during training. Leveraging this result, we establish theoretical bounds on the distance between the true mean-field dynamics and those obtained using the transformer.
Paperid:308
Authors:Bryan Lincoln Marques de Oliveira · Brandão · Murilo L. da Luz · Luana Martins · telma de lima soares · Luckeciano Melo
Abstract:
Learning effective visual representations enables agents to extract meaningful information from raw sensory inputs, which is essential for generalizing across different tasks. However, evaluating representation learning separately from policy learning remains a challenge with most reinforcement learning (RL) benchmarks. To address this gap, we introduce the Sliding Puzzles Gym (SPGym), a novel benchmark that reimagines the classic 8tile puzzle with a visual observation space of images sourced from arbitrarily large datasets. SPGym provides precise control over representation complexity through visual diversity, allowing researchers to systematically scale the representation learning challenge while maintaining consistent environment dynamics. Despite the apparent simplicity of the task, our experiments with both model-free and model-based RL algorithms reveal fundamental limitations in current methods. As we increase visual diversity by expanding the pool of possible images, all tested algorithms show significant performance degradation, with even state-of-the-art methods struggling to generalize across different visual inputs while maintaining consistent puzzle-solving capabilities. These results highlight critical gaps in visual representation learning for RL and provide clear directions for improving robustness and generalization in decision-making systems.
Paperid:309
Authors:Pengchong Hu · Zhizhong Han
Abstract:
Jointly estimating camera poses and mapping scenes are fundamental tasks in simultaneous localization and mapping (SLAM) from RGBD images. Stateof-the-art methods employ 3D Gaussians to represent a scene, and render these Gaussians through splatting for higher efficiency and better rendering. However, these methods cannot scale up to extremely large scenes, due to the inefficient tracking and mapping strategies that need to optimize all 3D Gaussisans in the limited GPU memories throughout the training to maintain the geometry and color consistency to all previous RGBD observations. To resolve this issue, we propose novel tracking and mapping strategies to work with a novel 3D representation, dubbed view-tied 3D Gaussians, for RGBD SLAM systems. View-tied 3D Gaussians is a kind of simplified Gaussians, which is tied to depth pixels, without needing to learn locations, rotations, and multi-dimensional variances. Tying Gaussians to views not only significantly saves storage but also allows us to employ many more Gaussians to represent local details in the limited GPU memories. Moreover, our strategies remove the need of maintaining all Guassians learnable throughout the training, improving rendering quality and tracking and mapping accuracy. We justify the effectiveness of these designs, and report better performance over the latest methods on the widely used benchmarks in terms of rendering and tracking accuracy and scalability.
Paperid:310
Authors:Matthew Wicker · Philip Sosnin · Igor Shilov · Adrianna Janik · Mark Müller · Yves-Alexandre de Montjoye · Adrian Weller · Calvin Tsay
Abstract:
We study private prediction where differential privacy is achieved by adding noise to the outputs of a nonprivate model. Existing methods rely on noise proportional to the global sensitivity of the model, often resulting in sub-optimal privacy-utility trade-offs compared to private training. We introduce a novel approach for computing dataset-specific upper bounds on prediction sensitivity by leveraging convex relaxation and bound propagation techniques. By combining these bounds with the smooth sensitivity mechanism, we significantly improve the privacy analysis of private prediction compared to global sensitivity-based approaches. Experimental results across real-world datasets in medical image classification and natural language processing demonstrate that our sensitivity bounds are can be orders of magnitude tighter than global sensitivity. Our approach provides a strong basis for the development of novel privacy preserving technologies.
Paperid:311
Authors:David K Park · Xihaier Luo · Guang Zhao · Seungjun Lee · Miruna Oprescu · Shinjae Yoo
Abstract:
Spatiotemporal learning is challenging due to the intricate interplay between spatial and temporal dependencies, the high dimensionality of the data, and scalability constraints. These challenges are further amplified in scientific domains, where data is often irregularly distributed (e.g., missing values from sensor failures) and highvolume (e.g., high-fidelity simulations), posing additional computational and modeling difficulties. In this paper, we present SCENT, a novel framework for scalable and continuity-informed spatiotemporal representation learning. SCENT unifies interpolation, reconstruction, and forecasting within a single architecture. Built on a transformer-based encoder-processor-decoder backbone, SCENT introduces learnable queries to enhance generalization and a query-wise cross-attention mechanism to effectively capture multi-scale dependencies. To ensure scalability in both data size and model complexity, we incorporate a sparse attention mechanism, enabling flexible output representations and efficient evaluation at arbitrary resolutions. We validate SCENT through extensive simulations and real-world experiments, demonstrating state-of-the-art performance across multiple challenging tasks while achieving superior scalability.
Paperid:312
Authors:Piotr Kubaty · Bartosz Wójcik · Bartłomiej Krzepkowski · Monika Michaluk · Tomasz Trzcinski · Jary Pomponi · Kamil Adamczewski
Abstract:
Early exits enable the network's forward pass to terminate early by attaching trainable internal classifiers to the backbone network. Existing earlyexit methods typically adopt either a joint training approach, where the backbone and exit heads are trained simultaneously, or a disjoint approach, where the heads are trained separately. However, the implications of this choice are often overlooked, with studies typically adopting one approach without adequate justification. This choice influences training dynamics and its impact remains largely unexplored. In this paper, we introduce a set of metrics to analyze early-exit training dynamics and guide the choice of training strategy. We demonstrate that conventionally used joint and disjoint regimes yield suboptimal performance. To address these limitations, we propose a mixed training strategy: the backbone is trained first, followed by the training of the entire multi-exit network. Through comprehensive evaluations of training strategies across various architectures, datasets, and early-exit methods we present strengths and weaknesses of the early exit training strategies. In particular, we show consistent improvements in performance and efficiency using the proposed mixed strategy.
Paperid:313
Authors:Amber Xie · Oleh Rybkin · Dorsa Sadigh · Chelsea Finn
Abstract:
Recent progress in imitation learning has been enabled by policy architectures that scale to complex visuomotor tasks, multimodal distributions, and large datasets. However, these methods often rely on learning from large amount of expert demonstrations.To address these shortcomings, we propose Latent Diffusion Planning (LDP), a modular approach consisting of a planner which can leverage actionfree demonstrations, and an inverse dynamics model which can leverage suboptimal data, that both operate over a learned latent space. First, we learn a compact latent space through a variational autoencoder, enabling effective forecasting of future states in image-based domains. Then, we train a planner and an inverse dynamics model with diffusion objectives. By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data.On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches, as they cannot leverage such additional data.
Paperid:314
Authors:Shuqing Luo · Pingzhi Li · Jie Peng · Yang Zhao · Yu Cao · Yu Cheng · Tianlong Chen
Abstract:
Abstract:Mixtureof-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over 40% runtime in large-scale training). In this paper, we first define $\textit{collaborative communication}$ to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them as $\textit{collaborated}$, which comprises $2$ cases as $\textit{intra-}$ and $\textit{inter-collaboration}$, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of intra-collaboration can accelerate expert parallel at scale. It motivates us to strategically $\underline{\texttt{o}}$ptimize $\underline{\texttt{c}}$ollaborative $\underline{\texttt{c}}$omm$\underline{\texttt{u}}$nication for acce$\underline{\texttt{l}}$era$\underline{\texttt{t}}$ed MoE training and inference, dubbed $\textbf{\texttt{Occult}}$. Our designs are capable of $\underline{either}$ delivering exact results with reduced communication cost, $\underline{or}$ controllably minimizing the cost with collaboration pruning, materialized by modified fine-tuning. Comprehensive experiments on various MoE-LLMs demonstrate that $\texttt{Occult}$ can be faster than popular state-of-the-art inference or training frameworks (over 50% speed up across multiple tasks and models) with comparable or superior quality compared to the standard fine-tuning. Codes will be available upon acceptance.
Paperid:315
Authors:Flavio Petruzzellis · Cristina Cornelio · Pietro Lió
Abstract:
Large Language Models (LLMs) have shown promise as robotic planners but often struggle with longhorizon and complex tasks, especially in specialized environments requiring external knowledge. While hierarchical planning and Retrieval-Augmented Generation (RAG) address some of these challenges, they remain insufficient on their own and a deeper integration is required for achieving more reliable systems. To this end, we propose a neuro-symbolic approach that enhances LLMs-based planners with Knowledge Graph-based RAG for hierarchical plan generation. This method decomposes complex tasks into manageable subtasks, further expanded into executable atomic action sequences. To ensure formal correctness and proper decomposition, we integrate a symbolic validator, which also functions as a failure detector by aligning expected and observed world states. Our evaluation against baseline methods demonstrates the consistent significant advantages of integrating hierarchical planning, symbolic verification, and RAG across tasks of varying complexity and different LLMs. Additionally, our experimental setup and novel metrics not only validates our approach for complex planning but also serves as a tool for assessing LLMs' reasoning and compositional capabilities.
Paperid:316
Authors:Sunil Madhow · Dheeraj Baby · Yu-Xiang Wang
Abstract:
Abstract:In order to attain optimal rates, stateof-the-art algorithms for non-parametric regression require that a hyperparameter be tuned according to the smoothness of the ground truth (Tibshirani, 2014). This amounts to an assumption of oracle access to certain features of the data-generating process. We present a parameter-free algorithm for offline non-parametric regression over $TV_1$-bounded functions. By feeding offline data into an optimal online denoising algorithm styled after (Baby et al., 2021), we are able to use change-points to adaptively select knots that respect the geometry of the underlying ground truth. We call this procedure AKORN (Adaptive Knots gener- ated Online for RegressioN splines). By combining forward and backward passes over the data, we obtain an estimator whose empirical performance is close to Trend Filtering (Kim et al., 2009; Tibshirani, 2014), even when we provide the latter with oracle knowledge of the ground truth’s smoothness.
Paperid:317
Authors:Lokesh Veeramacheneni · Moritz Wolter · Hilde Kuehne · Juergen Gall
Abstract:
Modern methods for finetuning a Vision Transformer (ViT) like Low-Rank Adaptation (LoRA) and its variants demonstrate impressive performance. However, these methods ignore the high-dimensional nature of Multi-Head Attention (MHA) weight tensors. To address this limitation, we propose Canonical Rank Adaptation (CaRA). CaRA leverages tensor mathematics, first by tensorising the transformer into two different tensors; one for projection layers in MHA and the other for feed-forward layers. Second, the tensorised formulation is fine-tuned using the low-rank adaptation in Canonical-Polyadic Decomposition (CPD) form. Employing CaRA efficiently minimizes the number of trainable parameters. Experimentally, CaRA outperforms existing Parameter-Efficient Fine-Tuning (PEFT) methods in visual classification benchmarks such as Visual Task Adaptation Benchmark (VTAB)-1k and Fine-Grained Visual Categorization (FGVC).
Paperid:318
Authors:Suryanarayana Sankagiri · Jalal Etesami · Matthias Grossglauser
Abstract:
This paper provides a theoretical analysis of a new learning problem for recommender systems where users provide feedback by comparing pairs of items instead of rating them individually. We assume that comparisons stem from latent user and item features, which reduces the task of predicting preferences to learning these features from comparison data. Similar to the classical matrix completion problem, the main challenge in this learning task is that the resulting loss function is nonconvex. Our analysis shows that the loss function exhibits (restricted) strong convexity near the true solution, which ensures gradientbased methods converge exponentially, given a warm start. Importantly, this result holds in a sparse data regime, where each user compares only a few pairs of items. Our main technical contribution is to extend certain concentration inequalities commonly used in matrix completion to our model. Our work demonstrates that learning personalized recommendations from comparison data is computationally and statistically efficient.
Paperid:319
Authors:Ariel Procaccia · Benjamin Schiffer · Shirley Zhang
Abstract:
A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions and has emerged as a popular alignment method. However, input datasets in RLHF can be unbalanced due to adversarial manipulation or inadvertent repetition. Therefore, we want RLHF algorithms to perform well even when the set of alternatives is not uniformly distributed. Drawing on insights from social choice theory, we introduce robustness to approximate clones, a desirable property of RLHF algorithms which requires that adding nearduplicate alternatives does not significantly change the learned reward function. We first demonstrate that the standard RLHF algorithm based on regularized maximum likelihood estimation (MLE) fails to satisfy this property. We then propose the weighted MLE, a new RLHF algorithm that modifies the standard regularized MLE by weighting alternatives based on their similarity to other alternatives. This new algorithm guarantees robustness to approximate clones while preserving desirable theoretical properties.
Paperid:320
Authors:Laure Ciernik · Lorenz Linhardt · Marco Morik · Jonas Dippel · Simon Kornblith · Lukas Muttenthaler
Abstract:
The Platonic Representation Hypothesis claims that recent foundation models are converging to a shared representation space as a function of their downstream task performance, irrespective of the objectives and data modalities used to train these models. Representational similarity is generally measured for individual datasets and is not necessarily consistent across datasets. Thus, one may wonder whether this convergence of model representations is confounded by the datasets commonly used in machine learning. Here, we propose a systematic way to measure how representational similarity between models varies with the set of stimuli used to construct the representations. We find that the objective function is a crucial factor in determining the consistency of representational similarities across datasets. Specifically, selfsupervised vision models learn representations whose relative pairwise similarities generalize better from one dataset to another compared to those of image classification or image-text models. Moreover, the correspondence between representational similarities and the models' task behavior is dataset-dependent, being most strongly pronounced for single-domain datasets. Our work provides a framework for analyzing similarities of model representations across datasets and linking those similarities to differences in task behavior.
Paperid:321
Authors:Erica Zhang · Fangzhao Zhang · Mert Pilanci
Abstract:
Active learning methods aim to improve sample complexity in machine learning. In this work, we investigate an active learning scheme via a novel gradientfree cutting-plane training method for ReLU networks of arbitrary depth and develop a convergence theory. We demonstrate, for the first time, that cutting-plane algorithms, traditionally used in linear models, can be extended to deep neural networks despite their nonconvexity and nonlinear decision boundaries. Moreover, this training method induces the first deep active learning scheme known to achieve convergence guarantees, revealing a geometric contraction rate of the feasible set. We exemplify the effectiveness of our proposed active learning method against popular deep active learning baselines via both synthetic data experiments and sentimental classification task on real datasets.
Paperid:322
Authors:Sidak Pal Singh · Hossein Mobahi · Atish Agarwala · Yann Nicolas Dauphin
Abstract:
Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks. However, we find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance even with twice the compute budget. We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominated by regularization of the logit statistics --- instead of improving the geometry of the function itself. We use this observation to develop an alternative algorithm we call Functional SAM, which regularizes curvature only through modification of the statistics of the overall function implemented by the neural network, and avoids spurious minimization through logit manipulation. Furthermore, we argue that preconditioning the SAM perturbation also prevents spurious minimization, and when combined with Functional SAM, it gives further improvements. Our proposed algorithms show improved performance over AdamW and SAM baselines when trained for an equal number of steps, in both fixed-length and Chinchilla-style training settings, at various model scales (including billion-parameter scale). On the whole, our work highlights the importance of more precise characterizations of sharpness in broadening the applicability of curvature regularization to large language models (LLMs)
Paperid:323
Authors:Yunfeng Liao · Jiawen Guan · Xiucheng Li
Abstract:
Deep models have recently achieved remarkable performances in solving partial differential equations (PDEs). The previous methods are mostly focused on PDEs arising in Euclidean spaces with less emphasis on the general manifolds with rich geometry. Several proposals attempt to account for the geometry by exploiting the spatial coordinates but overlook the underlying intrinsic geometry of manifolds. In this paper, we propose a Curvatureaware Graph Attention for PDEs on manifolds by exploring the important intrinsic geometric quantities such as curvature and discrete gradient operator. It is realized via parallel transport and tensor field on manifolds. To accelerate computation, we present three curvature-oriented graph embedding approaches and derive closed-form parallel transport equations, and a sub-tree partition method is also developed to promote parameter-sharing. Our proposed curvature-aware attention can be used as a replacement for vanilla attention, and experiments show that it significantly improves the performance of the existing methods for solving PDEs on manifolds. Our code is available at https://anonymous.4open.science/r/icml2025-5376/.
Paperid:324
Authors:Ayça Takmaz · Cristiano Saltori · Neehar Peri · Tim Meinhardt · Riccardo de Lutio · Laura Leal-Taixé · Aljosa Osep
Abstract:
Panoptic scene completion methods estimate dense object and scene geometry from sparse Lidar data. However, contemporary methods can only complete and recognize objects from a closed vocabulary, labeled in existing Lidar datasets. To overcome this limitation, we introduce CAL (Complete Anything in Lidar). Our zeroshot approach uses the temporal context from multi-modal sensor sequences to mine object shapes and semantic features that are then distilled into a Lidar-only instance-level completion and recognition model. Although we only mine partial shape completions, we find that our distilled model learns to infer full object shapes from multiple such partial observations across the dataset. We show that our model can be prompted on standard benchmarks for Semantic and Panoptic Scene Completion, localize objects as (amodal) 3D bounding boxes, and recognize objects beyond fixed class vocabularies.
Paperid:325
Authors:Louis Béthune · David Grangier · Dan Busbridge · Eleonora Gualdoni · Marco Cuturi · Pierre Ablin
Abstract:
A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised nexttoken prediction on data from that target domain.Finetuning presents two challenges:(i)if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and(ii)the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it.Our goal is to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales.We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting.A key practical takeaway from our study is that injecting as little as 1% of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.
Paperid:326
Authors:Ekaterina Borodich · Alexander Gasnikov · Dmitry Kovalev
Abstract:
Abstract:We revisit the smooth convexconcave bilinearly-coupled saddle-point problem of the form $\min_x\max_y f(x) + \langle y,\mathbf{B} x\rangle - g(y)$. In the highly specific case where function $f(x)$ is strongly convex and function $g(y)$ is affine, or both functions are affine, there exist lower bounds on the number of gradient evaluations and matrix-vector multiplications required to solve the problem, as well as matching optimal algorithms. A notable aspect of these algorithms is that they are able to attain linear convergence, i.e., the number of iterations required to solve the problem is proportional to $\log(1/\epsilon)$. However, the class of bilinearly-coupled saddle-point problems for which linear convergence is possible is much wider and can involve general smooth non-strongly convex functions $f(x)$ and $g(y)$. Therefore, *we develop the first lower complexity bounds and matching optimal linearly converging algorithms for this problem class*. Our lower complexity bounds are much more general, but they cover and unify the existing results in the literature. On the other hand, our algorithm implements the separation of complexities, which, for the first time, enables the simultaneous achievement of both optimal gradient evaluation and matrix-vector multiplication complexities, resulting in the best theoretical performance to date.
Paperid:327
Authors:Zecheng Tang · Zechen Sun · Juntao Li · Zhu Qiaoming · Min Zhang
Abstract:
Longcontext models (LCMs) have shown great potential in processing long input sequences (even more than 100M tokens) conveniently and effectively. With significant progress, recent research has pointed out that LCMs can accurately locate token-level salient information within the context. Yet, the generation performance of these LCMs is far from satisfactory and might result in misaligned responses, such as hallucinations. To enhance the generation capability of LCMs, existing works have investigated the effects of data size and quality for both pre-training and instruction tuning. Though achieving meaningful improvement, previous methods fall short in either effectiveness or efficiency. In this paper, we introduce LOGO (Long cOntext aliGnment via efficient preference Optimization), a training strategy that first introduces preference optimization for long-context alignment. To overcome the GPU memory-bound issue caused by the long sequence, LOGO employs a reference-free preference optimization strategy and adopts a position synthesis method to construct the training data. By training with only 0.3B data on a single 8 x A800 GPU machine for 16 hours, LOGO allows the Llama-3-8B-Instruct-80K model to achieve comparable performance with GPT-4 in real-world long-context tasks while preserving the model's original capabilities on other tasks, e.g., language modeling and MMLU. Moreover, LOGO can extend the model's context window size while enhancing its generation performance.
Paperid:328
Authors:Zhiwei XU · Kun Hu · Xin Xin · Weiliang Meng · Yiwei Shi · Hangyu Mao · Bin Zhang · dapeng Li · Jiangjin Yin
Abstract:
Generalizing multiagent reinforcement learning (MARL) to accommodate variations in problem configurations remains a critical challenge in real-world applications, where even subtle differences in task setups can cause pre-trained policies to fail. To address this, we propose Context-Aware Identity Generation (CAID), a novel framework to enhance MARL performance under the Contextual MARL (CMARL) setting. CAID dynamically generates unique agent identities through the agent identity decoder built on a causal Transformer architecture. These identities provide contextualized representations that align corresponding agents across similar problem variants, facilitating policy reuse and improving sample efficiency. Furthermore, the action regulator in CAID incorporates these agent identities into the action-value space, enabling seamless adaptation to varying contexts. Extensive experiments on CMARL benchmarks demonstrate that CAID significantly outperforms existing approaches by enhancing both sample efficiency and generalization across diverse context variants.
Paperid:329
Authors:Tianjie Ju · Yi Hua · Hao Fei · Zhenyu Shao · Yubin Zheng · Haodong Zhao · Mong-Li Lee · Wynne Hsu · Zhuosheng Zhang · Gongshen Liu
Abstract:
MultiModal Large Language Models (MLLMs) have exhibited remarkable performance on various vision-language tasks such as Visual Question Answering (VQA). Despite accumulating evidence of privacy concerns associated with task-relevant content, it remains unclear whether MLLMs inadvertently memorize private content that is entirely irrelevant to the training tasks. In this paper, we investigate how randomly generated task-irrelevant private content can become spuriously correlated with downstream objectives due to partial mini-batch training dynamics, thus causing inadvertent memorization. Concretely, we randomly generate task-irrelevant watermarks into VQA fine-tuning images at varying probabilities and propose a novel probing framework to determine whether MLLMs have inadvertently encoded such content. Our experiments reveal that MLLMs exhibit notably different training behaviors in partial mini-batch settings with task-irrelevant watermarks embedded. Furthermore, through layer-wise probing, we demonstrate that MLLMs trigger distinct representational patterns when encountering previously seen task-irrelevant knowledge, even if this knowledge does not influence their output during prompting. Our code is available atanonymity.
Paperid:330
Authors:Nadav Timor · Jonathan Mamou · Daniel Korat · Moshe Berchansky · Oren Pereg · Gaurav Jain · Moshe Wasserblat · David Harel
Abstract:
Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this sharedvocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization and long-context tasks, our algorithms achieve significant speedups over standard autoregressive decoding. By enabling any off-the-shelf model to serve as drafter and requiring no retraining, this work substantially broadens the applicability of the SD framework in practice.
Paperid:331
Authors:Chenguang Wang · Kaiyuan Cui · Weichen Zhao · Tianshu Yu
Abstract:
Sampling from binary quadratic distributions (BQDs) is a fundamental but challenging problem in discrete optimization and probabilistic inference. Previous work established theoretical guarantees for stochastic localization (SL) in continuous domains, where MCMC methods efficiently estimate the required posterior expectations during SL iterations. However, achieving similar convergence guarantees for discrete MCMC samplers in posterior estimation presents unique theoretical challenges. In this work, we present the first application of SL to general BQDs, proving that after a certain number of iterations, the external field of posterior distributions constructed by SL tends to infinity almost everywhere, hence satisfy Poincaré inequalities with probability near to 1, leading to polynomialtime mixing. This theoretical breakthrough enables efficient sampling from general BQDs, even those that may not originally possess fast mixing properties. Furthermore, our analysis, covering enormous discrete MCMC samplers based on Glauber dynamics and Metropolis-Hastings algorithms, demonstrates the broad applicability of our theoretical framework.Experiments on instances with quadratic unconstrained binary objectives, including maximum independent set, maximum cut, and maximum clique problems, demonstrate consistent improvements in sampling efficiency across different discrete MCMC samplers.
Paperid:332
Authors:Yaolun Zhang · Xiaogeng Liu · Chaowei Xiao
Abstract:
Large Language Models (LLMs) can solve various practical tasks via a multiagent system. However, existing human-designed multi-agent systems can only adapt to a limited number of pre-defined scenarios. Current auto-designed methods also have several drawbacks, including no tool support, reliance on external training data, and inflexible communication structure. Therefore, we propose \textbf{MetaAgent}, a novel framework to automatically generate a multi-agent system based on a \textbf{finite state machine}. Given a task description, MetaAgent will design a multi-agent system and polish it through an optimization algorithm that merges redundant states. When the multi-agent system is deployed, the finite state machine will control the agent's actions and the state transitions. To evaluate our framework, we conduct experiments on both text-based tasks and practical tasks, the results indicate that the generated multi-agent system surpasses other auto-designed methods and can achieve a comparable performance with the human-designed multi-agent system which is polished for those specific tasks.
Paperid:333
Authors:Tianci Liu · Ruirui Li · Zihan Dong · Hui Liu · Xianfeng Tang · Qingyu Yin · Linjun Zhang · Haoyu Wang · Jing Gao
Abstract:
Large language models (LLMs) have achieved remarkable performance on various natural language tasks. However, they are trained on static corpora and their knowledge can become outdated quickly in the fastchanging world. This motivates the development of knowledge editing (KE) to update specific knowledge in LLMs without changing unrelated others or compromising their pre-trained capabilities. Previous efforts sought to update a small amount of parameters of a LLM and proved effective for making selective updates. Nonetheless, the edited LLM often exhibits degraded ability to reason about the new knowledge. In this work, we identify a key issue: \textit{heterogeneous token overfitting} (HTO), where the LLM overfits different tokens in the provided knowledge at varying rates.To tackle this, we propose {\name}, a token-level smoothing method that mitigates HTO by adaptively refining the target distribution. Theoretically, OVERTONE offers better parameter updates with negligible computation overhead. It also induces an implicit DPO but does not require preference data pairs. Extensive experiments across four editing methods, two LLMs, and diverse scenarios demonstrate the effectiveness and versatility of our method.
Paperid:334
Authors:Yizi Zhang · Yanchen Wang · Mehdi Azabou · Alexandre Andre · Zixuan Wang · Hanrui Lyu · International Brain Laboratory · Eva Dyer · Department of Statistics Liam Paninski · Cole Hurwitz
Abstract:
Recent work has demonstrated that largescale, multi-animal models are powerful tools for characterizing the relationship between neural activity and behavior. Current large-scale approaches, however, focus exclusively on either predicting neural activity from behavior (encoding) or predicting behavior from neural activity (decoding), limiting their ability to capture the bidirectional relationship between neural activity and behavior. To bridge this gap, we introduce a multimodal, multi-task model that enables simultaneous Neural Encoding and Decoding at Scale (NEDS). Central to our approach is a novel multi-task-masking strategy, which alternates between neural, behavioral, within-modality, and cross-modality masking. We pretrain our method on the International Brain Laboratory (IBL) repeated site dataset, which includes recordings from 83 animals performing the visual decision-making task. In comparison to other large-scale modeling approaches, we demonstrate that NEDS achieves state-of-the-art performance for both encoding and decoding when pretrained on multi-animal data and then fine-tuned on new animals. Surprisingly, NEDS's learned embeddings exhibit emergent properties: even without explicit training, they are highly predictive of the brain regions in each recording. Altogether, our approach is a step towards a foundation model of the brain that enables seamless translation between neural activity and behavior.
Paperid:335
Authors:Ni Mu · Hao Hu · Xiao Hu · Yiqin Yang · Bo XU · Qing-Shan Jia
Abstract:
Preferencebased reinforcement learning (PbRL) bypasses explicit reward engineering by inferring reward functions from human preference comparisons, enabling better alignment with human intentions. However, humans often struggle to label a clear preference between similar segments, reducing label efficiency and limiting PbRL’s real-world applicability. To address this, we propose an offline PbRL method: Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY), which learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart, thus facilitating the selection of more unambiguous queries. Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.
Paperid:336
Authors:Chenyin Gao · Shu Yang · Mingyang Shan · Wenyu Ye · Ilya Lipkovich · Douglas Faries
Abstract:
Censored survival data are common in clinical trials, but small control groups can pose challenges, particularly in rare diseases or where balanced randomization is impractical. Recent approaches leverage external controls from historical studies or realworld data to strengthen treatment evaluation for survival outcomes. However, using external controls directly may introduce biases due to data heterogeneity. We propose a doubly protected estimator for the treatment-specific restricted mean survival time difference that is more efficient than trial-only estimators and mitigates biases from external data. Our method adjusts for covariate shifts via doubly robust estimation and addresses outcome drift using the DR-Learner for selective borrowing. The approach incorporates machine learning to approximate survival curves and detect outcome drifts without strict parametric assumptions, borrowing only comparable external controls. Extensive simulation studies and a real-data application evaluating the efficacy of Galcanezumab in mitigating migraine headaches have been conducted to illustrate the effectiveness of our proposed framework.
Paperid:337
Authors:Yanxiang Ma · Zixuan Huang · Minjing Dong · Shan You · Chang Xu
Abstract:
Random defense represents a promising strategy to protect neural networks from adversarial attacks. Most of these methods enhance robustness by injecting randomness into the data, increasing uncertainty for attackers.However, this randomness could reduce the generalization capacity of defense, as defense performance could be sensitive to the hyperparameters of noise added to the data, making it difficult to generalize across different datasets. Additionally, the involvement of randomness always comes with a reduction of natural accuracy, which leads to a delicate tradeoff between them, which is seldom studied in random defense. In this work, we propose incorporating randomness into the network structure instead of data input by designing stochastic deformable convolution, where a random mask replaces the convolutional offset. This process promotes data independence, enhancing generalization across datasets. To study the trade-off, we conduct a theoretical analysis of both robust and clean accuracy, from a perspective of gradient cosine similarity and natural inference. Based on the analysis, we reformulate the adversarial training in our random defense framework. Extensive experiments show that our method achieves SOTA adversarial robustness and clean accuracy compared with other random defense methods.
Paperid:338
Authors:Tao Tang · Lijun Zhou · Pengkun Hao · Zihang He · Kalok Ho · Shuo Gu · Zhihui Hao · Haiyang Sun · Kun Zhan · Peng Jia · XianPeng Lang · Xiaodan Liang
Abstract:
3D multiple object tracking (MOT) plays a crucial role in autonomous driving perception. Recent endto-end query-based trackers simultaneously detect and track objects, which have shown promising potential for the 3D MOT task. However, existing methods are still in the early stages of development and lack systematic improvements, failing to track objects in certain complex scenarios, like occlusions and the small size of target object’s situations. In this paper, we first summarize the current end-to-end 3D MOT framework by decomposing it into three constituent parts: query initialization, query propagation, and query matching. Then we propose corresponding improvements, which lead to a strong yet simple tracker: S2-Track. Specifically, for query initialization, we present 2D-Prompted Query Initialization, which leverages predicted 2D object and depth information to prompt an initial estimate of the object’s 3D location. For query propagation, we introduce an Uncertainty-aware Probabilistic Decoder to capture the uncertainty of complex environment in object prediction with probabilistic attention. For query matching, we propose a Hierarchical Query Denoising strategy to enhance training robustness and convergence. As a result, our S2-Track achieves state-of-the-art performance on nuScenes benchmark, i.e., 66.3% AMOTA on test split, surpassing the previous best end-to-end solution by a significant margin of 8.9% AMOTA.
Paperid:339
Authors:Yafei YANG · Zihui Zhang · Bo Yang
Abstract:
We study the challenging problem of unsupervised multiobject segmentation on single images. Existing methods, which rely on image reconstruction objectives to learn objectness or leverage pretrained image features to group similar pixels, often succeed only in segmenting simple synthetic objects or discovering a limited number of real-world objects. In this paper, we introduce OCN, a novel two-stage pipeline designed to identify many complex objects in real-world images. The key to our approach involves explicitly learning three levels of carefully defined object-centric representations in the first stage. Subsequently, our multi-object reasoning module utilizes these learned object priors to discover multiple objects in the second stage. Notably, this reasoning module is entirely network-free and does not require human labels. Extensive experiments demonstrate that OCN significantly outperforms all existing unsupervised methods across 6 real-world benchmark datasets, including the challenging COCO dataset, achieving state-of-the-art object segmentation results. Remarkably, our method excels in crowded images where all baselines collapse.
Paperid:340
Authors:Zhengyu Zhou · Weiwei Liu
Abstract:
Continuous Normalizing Flows (CNFs) have proven to be a highly efficient technique for generative modeling of complex data since the introduction of Flow Matching (FM). The core of FM is to learn the constructed velocity fields of CNFs through deep least squares regression. Despite its empirical effectiveness, theoretical investigations of FM remain limited. In this paper, we present the first endto-end error analysis of CNFs built upon FM. Our analysis shows that for general target distributions with bounded support, the generated distribution of FM is guaranteed to converge to the target distribution in the sense of the Wasserstein-2 distance. Furthermore, the convergence rate is significantly improved under an additional mild Lipschitz condition of the target score function.
Paperid:341
Authors:Mang Ning · Mingxiao Li · Jianlin Su · Jia Haozhe · Lanmiao Liu · Martin Benes · Wenshuo Chen · Albert Ali Salah · Itir Onal Ertugrul
Abstract:
Abstract:This paper explores image modeling from the frequency space and introduces DCTdiff, an endto-end diffusion generative paradigm that efficiently models images in the discrete cosine transform (DCT) space. We investigate the design space of DCTdiff and reveal the key design factors. Experiments on different frameworks (UViT, DiT), generation tasks, and various diffusion samplers demonstrate that DCTdiff outperforms pixel-based diffusion models regarding generative quality and training efficiency. Remarkably, DCTdiff can seamlessly scale up to 512$\times$512 resolution without using the latent diffusion paradigm and beats latent diffusion with only 1/4 training cost. Finally, we illustrate several intriguing properties of DCT image modeling. For example, we provide a theoretical proof of why `image diffusion can be seen as spectral autoregression', bridging the gap between diffusion and autoregressive models. The effectiveness of DCTdiff and the introduced properties suggest a promising direction for image modeling in the frequency space. The code is available.
Paperid:342
Authors:Peipeng Yu · Hui Gao · Xuan Feng · Jianwei Fei · Zhihua Xia · Chip Hong Chang
Abstract:
Current visionlanguage models (VLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misaligned of their knowledge and forensics patterns. To this end, we present a novel paradigm that unlocks VLMs' potential capabilities through three components: (1) A knowledge-guided forgery adaptation module that aligns VLM's semantic space with forensic features through contrastive learning with external manipulation knowledge; (2) A multi-modal prompt tuning framework that jointly optimizes visual-textual embeddings for both localization and explainability; (3) An iterative refinement strategy enabling multi-turn dialog for evidence-based reasoning. Our framework includes a VLM-based Knowledge-guided Forgery Detector (KFD), a VLM image encoder, and a Large Language Model (LLM). The VLM image encoder extracts visual prompt embeddings from images, while the LLM receives visual and question prompt embeddings for inference. The KFD is used to calculate correlations between image features and pristine/deepfake class embeddings, enabling forgery classification and localization. The outputs from these components are used to construct forgery prompt embeddings. Finally, we feed these prompt embeddings into the LLM to generate textual detection responses to assist judgment. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, and DFDC, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.
Paperid:343
Authors:yu chen · Jing Lian · Zhaofei Yu · Jizhao Liu · Jisheng Dang · Gang Wang
Abstract:
Event cameras are bioinspired vision sensors that encode visual information with high dynamic range, high temporal resolution, and low latency. Current state-of-the-art event stream processing methods rely on end-to-end deep learning techniques. However, these models are heavily dependent on data structures, limiting their stability and generalization capabilities across tasks, thereby hindering their deployment in real-world scenarios. To address this issue, we propose a chaotic dynamics event signal processing framework inspired by the dorsal visual pathway of the brain. Specifically, we utilize Continuous-coupled Neural Network (CCNN) to encode the event stream. CCNN encodes polarity-invariant event sequences as periodic signals and polarity-changing event sequences as chaotic signals. We then use continuous wavelet transforms to analyze the dynamical states of CCNN neurons and establish the high-order mappings of the event stream. The effectiveness of our method is validated through integration with conventional classification networks, achieving state-of-the-art classification accuracy on the N-Caltech101 and N-CARS datasets, with results of 84.3% and 99.9%, respectively. Our method improves the accuracy of event camera-based object classification while significantly enhancing the generalization and stability of event representation.
Paperid:344
Authors:Bing Liu · wenjun Miao · Boyu Zhang · Qiankun Zhang · Bin Yuan · Wang · Shenghao Liu · Xianjun Deng
Abstract:
Network quantization, one of the most widely studied model compression methods, effectively quantizes a floatingpoint model to obtain a fixed-point one with negligible accuracy loss. Although great success was achieved in reducing the model size, it may exacerbate the unfairness in model accuracy across different groups of datasets.This paper considers two widely used algorithms: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), with an attempt to understand how they cause this critical issue.Theoretical analysis with empirical verifications reveals two responsible factors, as well as how they influence a metric of fairness in depth.A comparison between PTQ and QAT is then made, explaining an observation that QAT behaves even worse than PTQ in fairness, although it often preserves a higher accuracy at lower bit-widths in quantization.Finally, the paper finds out that several simple data augmentation methods can be adopted to alleviate the disparate impacts of quantization, based on a further observation that class imbalance produces distinct values of the aforementioned factors among different attribute classes. We experiment on either imbalanced (UTK-Face and FER2013) or balanced (CIFAR-10 and MNIST) datasets using ResNet and VGG models for empirical evaluation.
Paperid:345
Authors:Zhenyu Sun · Ermin Wei
Abstract:
Abstract:Unlike its vanilla counterpart with i.i.d. samples, stochastic optimization with Markovian sampling allows the sampling scheme following a Markov chain. This problem encompasses various applications that range from asynchronous distributed optimization to reinforcement learning. In this work, we lower bound the sample complexity of finding $\epsilon$approximate critical solutions for any first-order methods when sampling is Markovian. We show that for samples drawn from stationary Markov processes with countable state space, any algorithm that accesses smooth, non-convex functions through queries to a stochastic gradient oracle, requires at least $\Omega(\epsilon^{-4})$ samples. Moreover, for finite Markov chains, we show a $\Omega(\epsilon^{-2})$ lower bound and propose a new algorithm, called MaC-SAGE, that is proven to (nearly) match our lower bound.
Paperid:346
Authors:Thomas Y.L. Lin · Jerry Yao-Chieh Hu · Wan-Jiun Paul Chiou · Peter Lin
Abstract:
Abstract:We propose a Bayesian reformulation of the BlackLitterman model for portfolio optimization without the need for subjective investor views and corresponding uncertainties $(q, \Omega)$ from humans.Classical Black-Litterman models treat $(q, \Omega)$ as user-provided factors but offer no rigorous mechanism for deriving these views, limiting their practicality. By recasting the Black-Litterman model as a Bayesian network, we incorporate asset-related features, treat $(q, \Omega)$ as latent variables, and estimate posterior distributions over both asset returns $r$ and their parameters $\theta$ directly from data. This approach unifies feature integration and parameter inference into a single framework, mitigating error propagation and ensuring coherent estimations. We explore two modes of feature integration: (1) features sharing latent parameters with returns and (2) features influencing investor views.Importantly,our formulation subsumes standard Black-Litterman formulas as special cases when $(q, \Omega)$ are given. Numerically, empirical tests on real-world datasets consistently outperform both the Markowitz and market-index baselines without heuristic views.
Paperid:347
Authors:Harry Zhang · Luca Carlone
Abstract:
We introduce CUPS, a novel method for learning sequenceto-sequence 3D human shapes and poses from RGB videos with uncertainty quantification. To improve on top of prior work, we develop a method to {generate and score multiple hypotheses during training}, effectively integrating uncertainty quantification into the learning process. This process results in a deep uncertainty function that is trained end-to-end with the 3D pose estimator. Post-training, the learned deep uncertainty model is used as the conformity score, which can be used to calibrate a conformal predictor in order to {assess} the quality of the output prediction. Since the data in human pose-shape learning is not fully exchangeable, we also {present} two practical bounds for the coverage gap in conformal prediction, developing theoretical backing for the uncertainty bound of our model. Our results indicate thatby taking advantage of deep uncertainty with conformal prediction, our method achieves state-of-the-art performance across variousmetrics and datasets while inheriting the probabilistic guarantees of conformal prediction. Interactive 3D visualization, code, and data will be available at https://sites.google.com/view/champpp.
Paperid:348
Authors:Anshuman Chhabra · Bo Li · Jian Chen · Prasant Mohapatra · Hongfu Liu
Abstract:
A core datacentric learning challenge is the identification of training samples that are detrimental to model performance. Influence functions serve as a prominent tool for this task and offer a robust framework for assessing training data influence on model predictions. Despite their widespread use, their high computational cost associated with calculating the inverse of the Hessian matrix pose constraints, particularly when analyzing large-sized deep models. In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection. This transformation not only presents a straightforward and Hessian-free formulation but also provides insights into the role of the gradient in sample impact. Through systematic empirical evaluations, we first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets. We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models. We also extend its use to influential sample identification for fine-tuning Large Language Models.
Paperid:349
Authors:Mingyi Li · Michael Metel · Akiko Takeda
Abstract:
The Kmeans algorithm is one of the most widely studied clustering algorithms in machine learning. While extensive research has focused on trying to achieve global optima, no algorithm has been developed with even a rigorous local optimality guarantee. In this paper, we first present conditions under which the K-means algorithm converges to a locally optimal solution. Based on this, we propose simple modifications to the K-means algorithm which ensure local optimality in the continuous and discrete senses, with the same computational complexity as the ordinary K-means algorithm. As the dissimilarity measure, we consider a general Bregman divergence, which is an extension of the squared Euclidean distance often used in the K-means algorithm. Numerical experiments on synthetic data confirm that the ordinary K-means algorithm does not always find local minima in practice, while our proposed method provides improved locally optimal solutions with smaller objective function values.
Paperid:350
Authors:Ze'ev Zukerman · Bassel Hamoud · Kfir Levy
Abstract:
Abstract:Distributed learning methods have gained substantial momentum in recent years, with communication overhead often emerging as a critical bottleneck. Gradient compression techniques alleviate communication costs but involve an inherent tradeoff between the empirical efficiency of biased compressors and the theoretical guarantees of unbiased compressors. In this work, we introduce a novel Multilevel Monte Carlo (MLMC) compression scheme that leverages biased compressors to construct statistically unbiased estimates. This approach effectively bridges the gap between biased and unbiased methods, combining the strengths of both. To showcase the versatility of our method, we apply it to top-$k$ and bit-wise compressors, resulting in enhanced variants. Furthermore, we derive an adaptive version of our approach to further improve its performance. We validate our method empirically on distributed learning tasks.
Paperid:351
Authors:Zhangyi Liu · Feng Liu · Rui Gao · Shuang Li
Abstract:
Abstract:We study convergence properties of the discretetime Mean-Field Langevin Stochastic Descent-Ascent (MFL-SDA) algorithm for solving distributional minimax optimization. These problems arise in various applications, such as zero-sum games, generative adversarial networks and distributionally robust learning. Despite the significance of MFL-SDA in these contexts, the discrete-time convergence rate remains underexplored.To address this gap, we establish a last-iterate convergence rate of $O(\frac{1}{\epsilon}\log\frac{1}{\epsilon})$ for MFL-SDA. This rate is nearly optimal when compared to the complexity lower bound of its Euclidean counterpart. This rate also matches the complexity of mean-field Langevin stochastic gradient descent for distributional minimization and the outer-loop iteration complexity of an existing double-loop algorithm for distributional minimax problems.By leveraging an elementary analysis framework that avoids PDE-based techniques, we overcome previous limitations and achieve a faster convergence rate.
Paperid:352
Authors:Zitao Wang · Ziyuan Wang · Molei Liu · Nian Si
Abstract:
Transfer learning is a popular strategy to leverage external knowledge and improve statistical efficiency, particularly with a limited target sample. We propose a novel knowledgeguided Wasserstein Distributionally Robust Optimization (KG-WDRO) framework that incorporates multiple sources of external knowledge to overcome the conservativeness of traditional WDRO, which often results in overly pessimistic shrinkage toward zero. Our method constructs smaller Wasserstein ambiguity sets by controlling the transportation along directions informed by the source knowledge. This strategy can alleviate perturbations on the predictive projection of the covariates and protect against information loss. Theoretically, we establish the equivalence between our WDRO formulation and the knowledge-guided shrinkage estimation based on collinear similarity, ensuring tractability and geometrizing the feasible set. This also reveals a novel and general interpretation for recent shrinkage-based transfer learning approaches from the perspective of distributional robustness. In addition, our framework can adjust for scaling differences in the regression models between the source and target and accommodates general types of regularization such as lasso and ridge. Extensive simulations demonstrate the superior performance and adaptivity of KG-WDRO in enhancing small-sample transfer learning.
Paperid:353
Authors:Min Zhao · Guande He · Yixiao Chen · Hongzhou Zhu · Chongxuan Li · Jun Zhu
Abstract:
Abstract:Recent advancements in video generation have enabled models to synthesize highquality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge and existing length extrapolation methods lead to \textit{temporal repetition} or \textit{motion deceleration}. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an \textit{intrinsic frequency} that primarily governs extrapolation behavior. Based on this insight, we propose \textbf{RIFLEx}, a \textit{minimal yet effective} approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true \textit{free lunch}—achieving high-quality $2\times$ extrapolation on state-of-the-art video diffusion transformers in a completely \textit{training-free} manner. Moreover, it enhances quality and enables $3\times$ extrapolation by minimal fine-tuning without long videos.
Paperid:354
Authors:Hui Dai · Ryan Teehan · Mengye Ren
Abstract:
Many existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to the emergence of new models and training data. These benchmarks also fall short in assessing how LLM performance changes over time, as they consist of a static set of questions without a temporal dimension. To address these limitations, we propose using future event prediction as a continuous evaluation method to assess LLMs' temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates questionanswer (QA) pairs from daily news, challenging LLMs to predict "future" event outcomes. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates.
Paperid:355
Authors:Andrea Testa · Søren Hauberg · Tamim Asfour · Leonel Rozo
Abstract:
Accurately modeling and predicting complex dynamical systems, particularly those involving contact and interaction, is crucial for applications ranging from fluid dynamics to robotics, but presents significant challenges due to the intricate interplay of geometric constraints and energy exchange. This paper introduces Geometric Contact Flows (GFC), a novel framework leveraging Riemannian and Contact geometry as inductive biases to learn such systems. GCF constructs a latent contact Hamiltonian model encoding desirable properties like stability or energy conservation. An ensemble of contactomorphisms then adapts this model to the target dynamics, preserving these properties while enabling robust generalization and adaptation to unseen scenarios. This ensemble allows for uncertaintyaware geodesics that attract the system's behavior toward the data support.Experiments on handwriting dynamics tasks and a robotic interaction task demonstrate the effectiveness of our approach.
Paperid:356
Authors:Daoyu Wang · Mingyue Cheng · Zhiding Liu · Qi Liu
Abstract:
Selfsupervised learning has garnered increasing attention in time series analysis for benefiting various downstream tasks and reducing reliance on labeled data. Despite its effectiveness, existing methods often struggle to comprehensively capture both long-term dynamic evolution and subtle local patterns in a unified manner. In this work, we propose \textbf{TimeDART}, a novel self-supervised time series pre-training framework that unifies two powerful generative paradigms to learn more transferable representations. Specifically, we first employ a \textit{causal} Transformer encoder, accompanied by a patch-based embedding strategy, to model the evolving trends from left to right. Building on this global modeling, we further introduce a denoising diffusion process to capture fine-grained local patterns through forward diffusion and reverse denoising. Finally, we optimize the model in an autoregressive manner. As a result, TimeDART effectively accounts for both global and local sequence features in a coherent way. We conduct extensive experiments on public datasets for time series forecasting and classification. The experimental results demonstrate that TimeDART consistently outperforms previous compared methods, validating the effectiveness of our approach. Our code is available at \url{https://anonymous.4open.science/r/TimeDART-2025/}.
Paperid:357
Authors:Yunzhuo Hao · Jiawei Gu · Huichen Wang · Linjie Li · Zhengyuan Yang · Lijuan Wang · Yu Cheng
Abstract:
The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains underexplored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.
Paperid:358
Authors:Jiaqi Leng · Bin Shi
Abstract:
With rapid advancements in machine learning, firstorder algorithms have emerged as the backbone of modern optimization techniques, owing to their computational efficiency and low memory requirements. Recently, the connection between accelerated gradient methods and damped heavy-ball motion, particularly within the framework of Hamiltonian dynamics, has inspired the development of innovative quantum algorithms for continuous optimization. One such algorithm, Quantum Hamiltonian Descent (QHD), leverages quantum tunneling to escape saddle points and local minima, facilitating the discovery of global solutions in complex optimization landscapes. However, QHD faces several challenges, including slower convergence rates compared to classical gradient methods and limited robustness in highly non-convex problems due to the non-local nature of quantum states. Furthermore, the original QHD formulation primarily relies on function value information, which limits its effectiveness. Inspired by insights from high-resolution differential equations that have elucidated the acceleration mechanisms in classical methods, we propose an enhancement to QHD by incorporating gradient information, leading to what we call gradient-based QHD. This gradient-based QHD achieves faster convergence and significantly increases the likelihood of identifying global solutions. Numerical simulations on challenging problem instances demonstrate that this gradient-based QHD outperforms existing quantum and classical methods by at least an order of magnitude.
Paperid:359
Authors:Yuqin Cao · Xiongkuo Min · Yixuan Gao · Wei Sun · Guangtao Zhai
Abstract:
Many videoto-audio (VTA) methods have been proposed for dubbing silent AI-generated videos. An efficient quality assessment method for AI-generated audio-visual content (AGAV) is crucial for ensuring audio-visual quality. Existing audio-visual quality assessment methods struggle with unique distortions in AGAVs, such as unrealistic and inconsistent elements. To address this, we introduceAGAVQA, the first large-scale AGAV quality assessment dataset, comprising 3,382 AGAVs from 16 VTA methods. AGAVQA includes two subsets: AGAVQA-MOS, which provides multi-dimensional scores for audio quality, content consistency, and overall quality, and AGAVQA-Pair, designed for optimal AGAV pair selection. We further proposeAGAV-Rater, a LMM-based model that can score AGAVs, as well as audio and music generated from text, across multiple dimensions, and selects the best AGAV generated by VTA methods to present to the user. AGAV-Rater achieves state-of-the-art performance on AGAVQA, Text-to-Audio, and Text-to-Music datasets. Subjective tests also confirm that AGAV-Rater enhances VTA performance and user experience.
Paperid:360
Authors:Yudong Hu · Haoran Li · Congying Han · Tiande Guo · Bonan Li · Mingqiang Li
Abstract:
Solving the Nash equilibrium in normalform games with large-scale strategy spaces presents significant challenges. Open-ended learning frameworks, such as PSRO and its variants, have emerged as effective solutions. However, these methods often lack an efficient metric for evaluating strategy improvement, which limits their effectiveness in approximating equilibria.In this paper, we introduce a novel evaluative metric called Advantage, which possesses desirable properties inherently connected to the Nash equilibrium, ensuring that each strategy update approaches equilibrium. Building upon this, we propose the Advantage Policy Space Response Oracle~(A-PSRO), an innovative unified open-ended learning framework applicable to both zero-sum and general-sum games. A-PSRO leverages the Advantage as a refined evaluation metric, leading to a consistent learning objective for agents in normal-form games. Experiments showcase that A-PSRO significantly reduces exploitability in zero-sum games and improves rewards in general-sum games, outperforming existing algorithms and validating its practical effectiveness.
Paperid:361
Authors:Honghua Zhang · Meihua Dang · Benjie Wang · Stefano Ermon · Nanyun Peng · Guy Van den Broeck
Abstract:
Probabilistic Circuits (PCs) are tractable representations of probability distributions allowing for exact and efficient computation of likelihoods and marginals. Recent advancements have improved the scalability of PCs by leveraging their sparse properties or tensorized operations. However, no existing method fully exploits both aspects simultaneously. In this paper, we propose a novel sparse and structured parameterization for the sum blocks in PCs. By replacing dense matrices with sparse Monarch matrices, we significantly reduce the memory and computation costs, scaling PCs to next level. From a theory perspective, our construction arises naturally from circuit multiplication; from a practical perspective, compared to previous efforts on scaling up tractable probabilistic models, our approach not only achieves stateof-the-art generative modeling performance on challenging benchmarks like Text8, LM1B and ImageNet, but also demonstrates superior scaling behavior, achieving the same performance with substantially less compute as measured by the number of floating-point operations (FLOPs) during training.
Paperid:362
Authors:Matthew Faw · Rajat Sen · Yichen Zhou · Abhimanyu Das
Abstract:
Motivated by the recent success of timeseries foundation models for zero-shot forecasting, we present a methodology forin-context fine-tuningof a time-series foundation model. In particular, we design a pretrained foundation model that can be prompted (at inference time) with multiple time-series examples, in order to forecast a target time-series into the future. Our foundation model is specifically trained to utilize examples from multiple related time-series in its context window (in addition to the history of the target time-series) to help it adapt to the specific distribution of the target domain at inference time. We show that such a foundation model that uses in-context examples at inference time can obtain much better performance on popular forecasting benchmarks compared to supervised deep learning methods, statistical models, and other time series foundation models. Interestingly, our in-context fine-tuning approach even matches the performance of a foundation model that is explicitly fine-tuned on the target domain.
Paperid:363
Authors:Suchith Prabhu · Bhavyajeet Singh · Anshul Mittal · Siddarth Asokan · Shikhar Mohan · Deepak Saini · Yashoteja Prabhu · Lakshya Kumar · Jian Jiao · Amit Singh · Niket Tandon · Manish Gupta · Sumeet Agarwal · Manik Varma
Abstract:
Retrievalaugmented classification and generation models significantly benefit from theearly-stage fusionof high-quality text-based auxiliary metadata, often called memory, but they suffer from high inference latency and poor robustness to noise. In classifications tasks, particularly the extreme classification (XC) setting, where low latency is critical, existing methods incorporate metadata for context enrichment via an XC-based retriever and obtain the representations of the relevant memory items to performlate-stagefusion to achieve low latency and improved resilience to noisy metadata. With an aim of achieving higher accuracy while meeting the low latency constraints, in this paper, we propose MOGIC, an approach for metadata-infused Oracle guidance for XC tasks. In particular, we train an early-fusion Oracle classifier with access to both query-side and label-side ground-truth metadata in the textual form. The Oracle is subsequently used to guide the training of any existing memory-based XC disciple model via regularization. The MOGIC algorithm, when applied to memory-based XC disciple models such as OAK, improves precision@1 and propensity-scored precision@1 by ∼2% on six standard datasets, at no additional inference time costs to the disciple model. We also show the feasibility of applying the MOGIC algorithm to improve the performance of state-of-the-art memory-free XC approaches such as NGAME or DEXA, demonstrating that the MOGIC algorithm can be used atop any existing XC-based approach in aplug-and-playmanner. Finally, we also show the robustness of the MOGIC method to missing and noisy metadata settings. We will release code on acceptance.
Paperid:364
Authors:Shiyu Wu · Jing Liu · Jing Li · Yequan Wang
Abstract:
Abstract:Current fake image detectors trained on large synthetic image datasets perform satisfactorily on limited studied generative models. However, these detectors suffer a notable performance decline over unseen models. Besides, collecting adequate training data from online generative models is often expensive or infeasible. To overcome these issues, we propose FewShot Detector (FSD), a novel AI-generated image detector which learns a specialized metric space to effectively distinguish unseen fake images by utilizing very few samples. Experiments show that FSD achieves state-of-the-art performance by $+7.4\%$ average ACC on GenImage dataset. More importantly, our method is better capable of capturing the intra-category common features in unseen images without further training.
Paperid:365
Authors:Yingying Feng · Jie Li · Chi Xie · Lei Tan · Jiayi Ji
Abstract:
We present MFRNet, a novel network for multimodal object re-identification that integrates multi-spectral data features to effectively retrieve specific people across different modalities. Current methods suffer from two principal limitations: (1) insufficient interaction between pixel-level semantic features across modalities, and (2) difficulty in balancing modality-shared and modality-specific features within a unified architecture. To address these challenges, our network introduces two core components. First, the Feature Fusion Module (FFM) enables fine-grained pixel-level feature generation and flexible cross-modal interaction. Second, the Feature Representation Module (FRM) efficiently extracts and combines modality-specific and modality-shared features, achieving strong discriminative ability with minimal parameter overhead. Extensive experiments on three challenging public datasets (RGBNT201, RGBNT100, and MSVR310) demonstrate the superiority of our approach in terms of both accuracy and efficiency, with 8.4% mAP and 6.9% accuracy improved in RGBNT201 with nearly no extra computation.
Paperid:366
Authors:Arhit Chakrabarti · Yang Ni · Debdeep Pati · Bani Mallick
Abstract:
We consider the problem of clustering grouped data for which the observations may include groupspecific variables in addition to the variables that are shared across groups. This type of data is quite common; for example, in cancer genomic studies, molecular information is available for all cancers whereas cancer-specific clinical information may only be available for certain cancers. Existing grouped clustering methods only consider the shared variables but ignore valuable information from the group-specific variables. To allow for these group-specific variables to aid in the clustering, we propose a novel Bayesian nonparametric approach, termed global-local (GLocal) Dirichlet process, that models the ``global-local" structure of the observations across groups. We characterize the GLocal Dirichlet process using the stick-breaking representation and the representation as a limit of a finite mixture model. We theoretically quantify the approximation errors of the truncated prior, the corresponding finite mixture model, and the associated posterior distribution. We develop a fast variational Bayes algorithm for scalable posterior inference, which we illustrate with extensive simulations and a TCGA pan-gastrointestinal cancer dataset.
Paperid:367
Authors:Jason Chan · Robert Gaizauskas · Zhixue Zhao
Abstract:
Formal logic enables computers to reason in natural language by representing sentences in symbolic forms and applying rules to derive conclusions. However, in what our study characterizes as "rulebreaker" scenarios, this method can lead to conclusions that are typically not inferred or accepted by humans given their common sense and factual knowledge. Inspired by works in cognitive science, we create RULEBREAKERS, the first dataset for rigorously evaluating the ability of large language models (LLMs) to recognize and respond to rulebreakers (versus nonrulebreakers) in a human-like manner. Evaluating seven LLMs, we find that most models, including GPT-4o, achieve mediocre accuracy on RULEBREAKERS and exhibit some tendency to over-rigidly apply logical rules unlike what is expected from typical human reasoners. Further analysis suggests that this apparent failure is potentially associated with the models' poor utilization of their world knowledge and their attention distribution patterns. Whilst revealing a limitation of current LLMs, our study also provides a timely counterbalance to a growing body of recent works that propose methods relying on formal logic to improve LLMs' general reasoning capabilities, highlighting their risk of further increasing divergence between LLMs and human-like reasoning.
Paperid:368
Authors:Kaizhen Zhu · Mokai Pan · Yuexin Ma · Yanwei Fu · Jingyi Yu · Jingya Wang · Ye Shi
Abstract:
Abstract:Recent advances in diffusion bridge models leverage Doob’s $h$transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches frequently produce blurred or excessively smoothed image details and lack a comprehensive theoretical foundation to explain these shortcomings. To address these limitations, we propose UniDB, a unified framework for diffusion bridges based on Stochastic Optimal Control (SOC). UniDB formulates the problem through an SOC-based optimization and derives a closed-form solution for the optimal controller, thereby unifying and generalizing existing diffusion bridge models. We demonstrate that existing diffusion bridges employing Doob’s $h$-transform constitute a special case of our framework, emerging when the terminal penalty coefficient in the SOC cost function tends to infinity. By incorporating a tunable terminal penalty coefficient, UniDB achieves an optimal balance between control costs and terminal penalties, substantially improving detail preservation and output quality. Notably, UniDB seamlessly integrates with existing diffusion bridge models, requiring only minimal code modifications. Extensive experiments across diverse image restoration tasks validate the superiority and adaptability of the proposed framework. Our code is available at https://github.com/UniDB-SOC/UniDB/.
Paperid:369
Authors:Jiayu Liu · Zhenya Huang · Chaokun Wang · Xunpeng Huang · Chengxiang Zhai · Enhong Chen
Abstract:
Owing to the capability of incontext learning, large language models (LLMs) have shown impressive performance across diverse mathematical reasoning benchmarks. However, we find that few-shot demonstrations can sometimes bring negative performance and their effectiveness on LLMs' reasoning abilities remains unreliable. To this end, in this paper, we aim to theoretically analyze the impact of in-context demonstrations on LLMs' reasoning performance. We prove that the reasoning efficacy (measured by empirical prediction loss) can be bounded by a \emph{LLM-oriented semantic similarity} and an \emph{inference stability of demonstrations}, which is general for both one-shot and few-shot scenarios. Based on this finding, we propose a straightforward, generalizable, and low-complexity demonstration selection method named LMS3. It facilitates to select the most pertinent samples for different LLMs and includes a novel demonstration rejection mechanism to automatically filter out samples that are unsuitable for few-shot learning. Through experiments on three representative benchmarks, two LLM backbones, and multiple few-shot settings, we verify that our LMS3 has superiority and achieves consistent improvements on all datasets, which existing methods have been unable to accomplish.
Paperid:370
Authors:Gramoz Goranci · Peter Kiss · Neel Patel · Martin Seybold · Eva Szilagyi · Da Wei Zheng
Abstract:
Abstract:We consider the Euclidean bichromatic matching problem in the dynamic setting, where the goal is to efficiently process point insertions and deletions while maintaining a high-quality solution. Computing the minimum cost bi-chromatic matching is one of the core problems in geometric optimization that has found many applications, most notably in estimating Wasserstein distance between two distributions. In this work, we present the first fully dynamic algorithm for Euclidean bi-chromatic matching with sublinear update time. For any fixed $\varepsilon > 0$, our algorithm achieves $O(1/\varepsilon)$-approximation and handles updates in $O(n^{\varepsilon})$ time. Our experiments show that our algorithm enables effective monitoring of the distributional drift in the Wasserstein distance on real and synthetic data sets, while outperforming the runtime of baseline approximations by orders of magnitudes.
Paperid:371
Authors:Wenxu Wang · Rui Zhou · Jing Wang · Bo Han · Yun Zhou · Cheng Zhu · Ruichun Tang · Nevin Zhang
Abstract:
OpenSet Domain Adaptation (OSDA) aims to transfer knowledge from the labeled source domain to the unlabeled target domain that contains unknown categories, thus facing the challenges of domain shift and unknown category recognition. While recent works have demonstrated the potential of causality for domain alignment, little exploration has been conducted on causal-inspired theoretical frameworks for OSDA. To fill this gap, we introduce the concept ofSusceptibilityand propose a novelCounterfactual-based susceptibility risk framework forOSDA, termedCOSDA. Specifically, COSDA consists of three novel components: (i) aSusceptibility Risk Estimator (SRE)for capturing causal information, along with comprehensive derivations of the computable theoretical upper bound, forming a risk minimization framework under the OSDA paradigm; (ii) aContrastive Feature Alignment (CFA)module, which is theoretically proven based on mutual information to satisfy theExogeneityassumption and facilitate cross-domain feature alignment; (iii) aVirtual Multi-unknown-categories Prototype (VMP)pseudo-labeling strategy, providing label information by measuring how similar samples are to known and multiple virtual unknown category prototypes, thereby assisting in open-set recognition and intra-class discriminative feature learning. Extensive experiments on both benchmark and synthetic datasets demonstrate that our approach achieves state-of-the-art performance.
Paperid:372
Authors:Runyu Lu · Yuanheng Zhu · Dongbin Zhao
Abstract:
This paper proposes Constrained Exploitability Descent (CED), a modelfree offline reinforcement learning (RL) algorithm for solving adversarial Markov games. CED combines the game-theoretical approach of Exploitability Descent (ED) with policy constraint methods from offline RL. While policy constraints can perturb the optimal pure-strategy solutions in single-agent scenarios, we find the side effect less detrimental in adversarial games, where the optimal policy can be a mixed-strategy Nash equilibrium. We theoretically prove that, under the uniform coverage assumption on the dataset, CED converges to a stationary point in deterministic two-player zero-sum Markov games (MGs). We further prove that the min-player policy at the stationary point follows the property of mixed-strategy Nash equilibrium in MGs. Compared to the model-based ED method that optimizes the max-player policy, our CED method no longer relies on a generalized gradient. Experiments in matrix games, a tree-form game, and an infinite-horizon soccer game verify that CED can find an equilibrium policy for the min-player as long as the offline dataset guarantees uniform coverage. Besides, CED achieves a significantly lower NashConv compared to an existing pessimism-based method and can gradually improve the behavior policy even under non-uniform data coverages.
Paperid:373
Authors:Xiaojie Wang · Bin Yang
Abstract:
Abstract:Generating samples from a high dimensional probability distribution is a fundamental task with wideranging applications in the area of scientific computing, statistics and machine learning. This article revisits the popular Langevin Monte Carlo (LMC) sampling algorithms and provides an optimal non-asymptotic error analysis in $\mathcal{W}_2$-distance in a non-convex setting. In particular, we prove an optimal error bound $O(\sqrt{d} h)$, which guarantees a mixing time $ \tilde{O} (\sqrt{d} \epsilon^{-1})$ to achieve the accuracy tolerance $\epsilon$, under certain log-smooth conditions and the assumption that the target distribution satisfies a log-Sobolev inequality, as opposed to the strongly log-concave condition used in (Li et al., 2019; 2022). This bound matches the best one in the strongly log-concave case and improves upon the best-known convergence rates in non-convex settings. To prove it, we establish a new framework of uniform-in-time convergence for discretizations of SDEs. Distinct from (Li et al., 2019; 2022), we start from the finite-time mean-square fundamental convergence theorem, which combined with uniform-in-time moment bounds of LMC and the exponential ergodicity of SDEs in the non-convex setting gives the desired uniform-in-time convergence. Our framework also applies to the case when the gradient of the potential $U$ is non-globally Lipschitz with superlinear growth, for which modified LMC samplers are proposed and analyzed, with a non-asymptotic error bound in $\mathcal{W}_2$-distance obtained. Numerical experiments corroborate the theoretical analysis.
Paperid:374
Authors:Tobias Braun · Mark Rothermel · Marcus Rohrbach · Anna Rohrbach
Abstract:
The proliferation of disinformation demands reliable and scalable factchecking solutions. We present Dynamic Evidence-based FAct-checking with Multimodal Experts (DEFAME), a modular, zero-shot MLLM pipeline for open-domain, text-image claim verification. DEFAME operates in a six-stage process, dynamically selecting the tools and search depth to extract and evaluate textual and visual evidence. Unlike prior approaches that are text-only, lack explainability, or rely solely on parametric knowledge, DEFAME performs end-to-end verification, accounting for images in claims and evidence while generating structured, multimodal reports. Evaluation on the popular benchmarks VERITE, AVerITeC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing itself as the new state-of-the-art fact-checking system for uni- and multimodal fact-checking. Moreover, we introduce a new benchmark, CLAIMREVIEW24+, featuring claims after the knowledge cutoff of GPT4o to avoid data leakage. Here, DEFAME drastically outperforms the GPT Chain-of-Thought baseline, demonstrating temporal generalizability and the potential for real-time fact-checking.
Paperid:375
Authors:Amber Yijia Zheng · Cedar Site Bai · Brian Bullins · Raymond A. Yeh
Abstract:
Model immunization aims to pretrain models that are difficult to fine-tune on harmful tasks while retaining their utility on other non-harmful tasks. Though prior work has shown empirical evidence for immunizing text-to-image models, the key understanding of when immunization is possible and a precise definition of an immunized model remains unclear. In this work, we propose a framework, based on the condition number of a Hessian matrix, to analyze model immunization for linear models. Building on this framework, we design an algorithm with regularization terms to control the resulting condition numbers after pre-training. Empirical results on linear models and non-linear deep-nets demonstrate the effectiveness of the proposed algorithm on model immunization.
Paperid:376
Authors:Wenbin Wang · Yu Shi · Ziping Zhao
Abstract:
The emergence of multidimensional data presents significant challenges for traditional regression models based on matrices or vectors, particularly in capturing multi-directional correlations. In response, tensor regression has been proposed as a powerful framework for modeling linear relationships among multi-dimensional variables. In this paper, we introduce a high-dimensional tensor-response tensor regression model under low-dimensional structural assumptions, such as sparsity and low-rankness. Assuming the underlying tensor lies within an unknown low-dimensional subspace, we consider a least squares estimation framework with non-convex penalties. Theoretically, we derive general risk bounds for the resulting estimators and demonstrate that they achieve the oracle statistical rates under mild technical conditions. To compute the proposed estimators efficiently, we introduce an accelerated proximal gradient algorithm demonstrating rapid convergence in practice. Extensive experiments on synthetic and real-world datasets validate the effectiveness of the proposed regression model and showcase the practical utility of the theoretical findings.
Paperid:377
Authors:Tianjian Li · Daniel Khashabi
Abstract:
Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that onpolicy data consistently outperforms off-policy data for preference learning, others indicate that the advantages of on-policy data are task-dependent, highlighting the need for a systematic exploration of their interplay.In this work, we show that on-policy and off-policy data offer complementary strengths: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on subjective tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SimpleMix, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SimpleMix substantially improves language model alignment. Specifically, SimpleMix improves upon on-policy DPO and off-policy DPO by an average of 6.03 on Alpaca Eval 2.0. Moreover, it surpasses prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3.05. These findings validate the effectiveness and efficiency of SimpleMix for enhancing preference-based alignment.
Paperid:378
Authors:Dong Li · Yidi Liu · Xueyang Fu · Jie Huang · Senyan Xu · Qi Zhu · Zheng-Jun Zha
Abstract:
Image deraining aims to remove rain streaks from rainy images and restore clear backgrounds. Currently, some research that employs the Fourier transform has proved to be effective for image deraining, due to it acting as an effective frequency prior for capturing rain streaks. However, despite there exists dependency of low frequency and high frequency in images, these Fourierbased methods rarely exploit the correlation of different frequencies for conjuncting their learning procedures, limiting the full utilization of frequency information for image deraining. Alternatively, the recently emerged Mamba technique depicts its effectiveness and efficiency for modeling correlation in various domains (e.g., spatial, temporal), and we argue that introducing Mamba into its unexplored Fourier spaces to correlate different frequencies would help improve image deraining. This motivates us to propose a new framework termed FourierMamba, which performs image deraining with Mamba in the Fourier space. Owing to the unique arrangement of frequency orders in Fourier space, the core of FourierMamba lies in the scanning encoding of different frequencies, where the low-high frequency order formats exhibit differently in the spatial dimension (unarranged in axis) and channel dimension (arranged in axis). Therefore, we design FourierMamba that correlates Fourier space information in the spatial and channel dimensions with distinct designs. Specifically, in the spatial dimension Fourier space, we introduce the zigzag coding to scan the frequencies to rearrange the orders from low to high frequencies, thereby orderly correlating the connections between frequencies; in the channel dimension Fourier space with arranged orders of frequencies in axis, we can directly use Mamba to perform frequency correlation and improve the channel information representation. Extensive experiments reveal that our method outperforms state-of-the-art methods both qualitatively and quantitatively.
Paperid:379
Authors:Rohan Shenoy · Evan Coleman · Hans Gaensbauer · Elsa Olivetti
Abstract:
Quantifying the elemental composition of a material is a general scientific challenge with broad relevance to environmental sustainability. Existing techniques for the measurement of atomic abundances generally require laboratory conditions and expensive equipment. As a result, they cannot be deployedin situwithout significant capital investment, limiting their proliferation. Measurement techniques based on nuclear magnetic resonance (NMR) hold promise in this setting due to their applicability across the periodic table, their nondestructive manipulation of samples, and their amenability toin silicooptimization. In this work, we learn policies to modulate NMR pulses for rapid atomic abundance quantification. Our approach involves three inter-operating agents which (1) rapidly align nuclear spins for measurement, (2) quickly force relaxation to equilibrium, and (3) toggle control between agents (1) and (2) to minimize overall measurement time. To demonstrate this technique, we consider a specific use case of low-magnetic-field carbon-13 quantification for low-cost, portable analysis of foodstuffs and soils. We find significant performance improvements relative to traditional NMR pulse sequencing, and discuss limitations on the applicability of this approach.
Paperid:380
Authors:Jin Zhu · Jingyi Li · Hongyi Zhou · Yinan Lin · Chengchun Shi · Zhenhua Lin
Abstract:
This paper focuses on the design of spatial experiments to optimize the amount of information derived from the experimental data and enhance the accuracy of the resulting causal effect estimator. We propose a surrogate function for the mean squared error (MSE) of the estimator, which facilitates the use of classical graph cut algorithms to learn the optimal design. Our proposal offers three key advances: (1) it accommodates moderate to large spatial interference effects; (2) it adapts to different spatial covariance functions; (3) it is computationally efficient. Theoretical results and numerical experiments based on synthetic environments and a dispatch simulator that models a cityscale ridesharing market, further validate the effectiveness of our design.
Paperid:381
Authors:Yuxuan Sun · Ruikang Liu · Haoli Bai · Han Bao · Kang Zhao · Yuening Li · JiaxinHu · Xianzhi Yu · Lu Hou · Chun Yuan · Xin Jiang · Wulong Liu · Jun Yao
Abstract:
Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally spaced quantization points. Prior research explores various prequantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still exhibit steep and dispersed distributions. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach that enhances the flatness of weights and activations. Our approach identifies optimal affine transformations tailored to each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead, we apply Kronecker decomposition to the transformation matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments demonstrate that FlatQuant establishes a new state-of-the-art benchmark for quantization. For example, the proposed method achieves less than 1\% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5\%. Additionally, it provides up to 2.3x prefill speedup and 1.7x decoding speedup compared to the FP16 model.
Paperid:382
Authors:Bilgehan Sel · Lifu Huang · Naren Ramakrishnan · Ruoxi Jia · Ming Jin
Abstract:
Large language models (LLMs) are making inroads into classical AI problems such as automated planning, yet key shortcomings continue to hamper their integration. Chainof-Thought (CoT) struggles in complex multi-step reasoning, and Tree-of-Thoughts requires multiple queries that increase computational overhead. Recently, Algorithm-of-Thoughts (AoT) have shown promise using in-context examples, at the cost of significantly longer solutions compared to CoT. Aimed at bridging the solution length gap between CoT and AoT, this paper introduces AoT-O3, which combines supervised finetuning on AoT-style plans with a reinforcement learning (RL) framework designed to reduce solution length. The RL component uses a reward model that favors concise, valid solutions while maintaining planning accuracy. Empirical evaluations indicate that AoT-O3 shortens solution length by up to 80\% compared to baseline AoT while maintaining or surpassing prior performance. These findings suggest a promising pathway for more efficient, scalable LLM-based planning.
Paperid:383
Authors:Hao Li · Yu-Hao Huang · Chang Xu · Viktor Schlegel · Renhe Jiang · Riza Batista-Navarro · Goran Nenadic · Jiang Bian
Abstract:
Timeseries Generation (TSG) is a prominent research area with broad applications in simulations, data augmentation, and counterfactual analysis. While existing methods have shown promise in unconditional single-domain TSG, real-world applications demand for cross-domain approaches capable of controlled generation tailored to domain-specific constraints and instance-level requirements. In this paper, we argue that text can provide semantic insights, domain information and instance-specific temporal patterns, to guide and improve TSG. We introduce ``Text-Controlled TSG'', a task focused on generating realistic time series by incorporating textual descriptions.To address data scarcity in this setting, we propose a novel LLM-based Multi-Agent framework that synthesizes diverse, realistic text-to-TS datasets. Furthermore, we introduce Bridge, a hybrid text-controlled TSG framework that integrates semantic prototypes with text description for supporting domain-level guidance. This approach achieves state-of-the-art generation fidelity on 11 of 12 datasets, and improves controllability by 12.52% on MSE and 6.34% MAE compared to no text input generation, highlighting its potential for generating tailored time-series data
Paperid:384
Authors:Georgy Noarov · Ramya Ramalingam · Aaron Roth · Stephan Xie
Abstract:
Abstract:We give an efficient algorithm for producing multidimensional forecasts in an online adversarial environment that have low bias subject to any polynomial number of conditioning events, that can depend both on external context and on our predictions themselves. We demonstrate the use of this algorithm with several applications. We show how to make predictions that can be transparently consumed by any polynomial number of downstream decision makers with different utility functions, guaranteeing them diminishing swap regret at optimal rates. We also give the first efficient algorithms for guaranteeing diminishing conditional regret in online combinatorial optimization problems for an arbitrary polynomial number of conditioning events --- i.e. on an arbitrary number of intersecting subsequences determined both by context and our own predictions. Finally we give the first efficient algorithm guaranteeing online multicalibration with $T^{2/3}$ rates in the ECE metric.
Paperid:385
Authors:Michael Sander · Vincent Roulet · Tianlin Liu · Mathieu Blondel
Abstract:
Energybased models (EBMs) offer a flexible framework for parameterizing probability distributions using neural networks.However, learning EBMs by exact maximum likelihood estimation (MLE) is generally intractable, due to the need to compute the partition function.In this paper, we propose a novel min-min formulation for approximately learning probabilistic EBMs in combinatorially-large discrete spaces, such as sets or permutations. Our key idea is to jointly learn both an energy model and its log-partition, parameterized as a neural network. Our approach not only provides a novel tractable objective criterion to learn EBMs by stochastic gradient descent (without relying on MCMC), but also a novel means to estimate the log-partition function on unseen data points.On the theoretical side, we show that our approach recovers the optimal MLE solution when optimizing in the space of continuous functions.Furthermore, we show that our approach naturally extends to the broader family of Fenchel-Young losses, allowing us to obtainthe first tractable method for optimizing the sparsemax loss in combinatorially-large spaces.We demonstrate our approach on multilabel classification and label ranking.
Paperid:386
Authors:Zhigaoyuan Wang · Ying Sun · Hengshu Zhu
Abstract:
Spatiotemporal forecasting provides potential for discovering evolutionary patterns in geographical scientific data. However, geographical scientific datasets are often manually collected across studies, resulting in limited time spans and data scales. This hinders existing methods that rely on rich historical data for individual entities. In this paper, we argue that heterogeneous datasets from different studies can provide complementary insights into the same underlying system, helping improve predictions for geographical entities with limited historical data. To this end, we propose a Segment Quadtree Geographical Embedding Framework (SQGEF). SQGEF integrates knowledge from datasets with varied target entities, time spans, and observation variables to learn unified representations for multi-granularity entities—including those absent during training. Specifically, we propose a novel data structure, Segment Quadtree, that flexibly accommodates entities of varying granularities. SQGEF not only captures multi-level interactions from grid data but also extracts nested relationships and human-defined boundaries from diverse entities, enabling a comprehensive understanding of complex geographical structures. Experiments on real-world datasets demonstrate that SQGEF effectively represents unseen geographical entities and enhances performance for various models.
Paperid:387
Authors:Zhengqi Gao · Kaiwen Zha · Tianyuan Zhang · Zihui Xue · Duane Boning
Abstract:
Guidance techniques are simple yet effective for improving conditional generation in diffusion models. Albeit their empirical success, the practical implementation of guidance diverges significantly from its theoretical motivation. In this paper, we reconcile this discrepancy by replacing the scaled marginal distribution target, which we prove theoretically invalid, with a valid scaled joint distribution objective. Additionally, we show that the established guidance implementations are approximations to the intractable optimal solution under no future foresight constraint. Building on these theoretical insights, we propose rectified gradient guidance (REG), a versatile enhancement designed to boost the performance of existing guidance methods. Experiments on 1D and 2D demonstrate that REG provides a better approximation to the optimal solution than prior guidance techniques, validating the proposed theoretical framework. Extensive experiments on classconditional ImageNet and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence.
Paperid:388
Authors:Ruizhe Wang · Yeyun Gong · Xiao Liu · Guoshuai Zhao · Ziyue Yang · Baining Guo · Zheng-Jun Zha · Peng CHENG
Abstract:
The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling lowbit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.
Paperid:389
Authors:Jueqing Lu · Wray Buntine · Yuanyuan Qi · Joanna Dipnall · Belinda Gabbe · Lan Du
Abstract:
Resolving conflicts is essential to make the decisions of multiview classification more reliable. Much research has been conducted on learning consistent and informative representations among different views, often assuming that all views are equally important and perfectly aligned. However, real-world multi-view data may not always conform to these assumptions, as some views may express distinct information. To address this issue, we develop a computational trust-based discounting method to enhance the existing Evidential Multi-view framework in scenarios where conflicts between different views may arise. Its belief fusion process considers the reliability of predictions made by individual views via an instance-wise probability-sensitive trust discounting mechanism.We evaluate our method on six real-world datasets, using Top-1 Accuracy, Fleiss’ Kappa, and a new metric called Multi-View Agreement with Ground Truth that takes into consideration the ground truth labels, to measure the reliability of the prediction.We also evaluate whether uncertainty measures can effectively indicate prediction correctness by calculating the AUROC.The experimental results show that computational trust can effectively resolve conflicts, paving the way for more reliable multi-view classification models in real-world applications.
Paperid:390
Authors:Feng Wang · Yaodong Yu · Wei Shao · Yuyin Zhou · Alan Yuille · Cihang Xie
Abstract:
Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a common image preprocessing approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively shorten the token sequence and reduce the computational cost of ViT-like plain architectures. In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1*1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and diverse architectures such as ViT and the recent Mamba models. Moreover, as a by-product, we discover that with smaller patches, task-specific decoder heads become less critical for dense prediction. In the experiments, we successfully scale up the visual sequence to an exceptional length of 50,176 tokens, achieving a competitive test accuracy of 84.6% with a base-sized model on the ImageNet-1k benchmark. We hope this study can provide insights and theoretical foundations for future works of building non-compressive vision models.
Paperid:391
Authors:Martin Van Waerebeke · Kevin Scaman · Marco Lorenzi · Giovanni Neglia
Abstract:
Machine Unlearning (MU) aims at removing the influence of specific data points from a trained model, striving to achieve this at a fraction of the cost of full model retraining. In this paper, we analyze the efficiency of unlearning methods and establish the first upper and lower bounds on minimax computation times for this problem, characterizing the performance of the most efficient algorithm against the most difficult objective function. Specifically, for strongly convex objective functions and under the assumption that the forget data is inaccessible to the unlearning method, we provide a phase diagram for the unlearning complexity ratioa novel metric that compares the computational cost of the best unlearning method to full model retraining. The phase diagram reveals three distinct regimes: one where unlearning at a reduced cost is infeasible, another where unlearning is trivial because adding noise suffices, and a third where unlearning achieves significant computational advantages over retraining. These findings highlight the critical role of factors such as data dimensionality, the number of samples to forget, and privacy constraints in determining the practical feasibility of unlearning.
Paperid:392
Authors:Jiangshan Wang · Junfu Pu · Zhongang Qi · Jiayi Guo · Yue Ma · Nisha Huang · Yuxin Chen · Xiu Li · Ying Shan
Abstract:
Rectifiedflow-based diffusion transformers like FLUX and OpenSora have demonstrated outstanding performance in the field of image and video generation. Despite their robust generative capabilities, these models often struggle with inversion inaccuracies, which could further limit their effectiveness in downstream tasks such as image and video editing. To address this issue, we propose RF-Solver, a novel training-free sampler that effectively enhances inversion precision by mitigating the errors in the ODE-solving process of rectified flow. Specifically, we derive the exact formulation of the rectified flow ODE and apply the high-order Taylor expansion to estimate its nonlinear components, significantly enhancing the precision of ODE solutions at each timestep. Building upon RF-Solver, we further propose RF-Edit, a general feature-sharing-based framework for image and video editing. By incorporating self-attention features from the inversion process into the editing process, RF-Edit effectively preserves the structural information of the source image or video while achieving high-quality editing results. Our approach is compatible with any pre-trained rectified-flow-based models for image and video tasks, requiring no additional training or optimization. Extensive experiments across generation, inversion, and editing tasks in both image and video modalities demonstrate the superiority and versatility of our method. The source code will be released.
Paperid:393
Authors:Minting Pan · Yitao Zheng · Jiajian Li · Yunbo Wang · Xiaokang Yang
Abstract:
Abstract:Offline reinforcement learning (RL) optimizes control policies using static datasets, reducing the need for costly and potentially unsafe exploration in realworld scenarios. However, it faces key challenges, particularly value overestimation due to the lack of an interactive environment. In this paper, we introduce Video-Enhanced Offline RL (VeoRL), a novel model-based RL approach that builds an interactive world model using a diverse collection of unlabeled, heterogeneous video data—freely accessible and potentially unlimited from the Internet. Through model-based policy optimization, VeoRL effectively transfers commonsense knowledge of control behaviors and physical dynamics, extracted from natural videos, to the RL agent. Our approach demonstrates substantial improvements over existing offline RL methods across a wide range of visuomotor control tasks, including robotic manipulation, autonomous driving, and open-world video games, delivering over a $100\\%$ performance boost on multiple tasks.
Paperid:394
Authors:Yifei Zhou · Qianlan Yang · Kaixiang Lin · Min Bai · Xiong Zhou · Yu-Xiong Wang · Sergey Levine · Li Li
Abstract:
A generalist foundation model agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of humanannotated instructions, the agent’s skill repertoire will necessarily be limited due to the scalability of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator (PAE), an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. After a context-aware task proposer generates instructions based on website information, the agent policy attempts those tasks in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and selfhosted websites from WebVoyager and WebArena. Our results show that PAE significantly improves the zero-shot generalization capability of VLM Internet agents (around 50% relative improvement)to both unseen tasks and websites.
Paperid:395
Authors:Severi Rissanen · RuiKang OuYang · Jiajun He · Wenlin Chen · Markus Heinonen · Arno Solin · Jose Miguel Hernandez-Lobato
Abstract:
Recent research has focused on designing neural samplers that amortize the process of sampling from unnormalized densities. However, despite significant advancements, they still fall short of the stateof-the-art MCMC approach, Parallel Tempering (PT), in terms of the number of target evaluations. On the other hand, we also identify several inefficiencies in PT: (1) information across different temperature levels is only exchanged through swapping samples; (2) the varying mixing rates across temperatures lead to wasted computation, particularly at low temperatures early on; and (3) PT’s reliance on MCMC results in dependent samples, slowing transitions between modes. To address the weaknesses in both approaches, we propose Progressive Tempering Sampler with Diffusion (PTSD), which trains diffusion models sequentially across temperatures. We introduce a novel method to combine high-temperature diffusion models to generate approximate lower-temperature samples, which are minimally refined using MCMC and used to train the next diffusion model. This approach enables efficient reuse of sample information across temperature levels while generating well-mixed, uncorrelated samples. Our method significantly improves target evaluation efficiency,outperforming diffusion-based neural samplers and demonstrating a promising direction to improve over PT.
Paperid:396
Authors:Boyuan Li · Yicheng Luo · Zhen Liu · Junhao Zheng · Jianming Lv · Qianli Ma
Abstract:
Irregular multivariate time series (IMTS) are characterized by irregular time intervals within variables and unaligned observations across variables, posing challenges in learning temporal and variable dependencies. Many existing IMTS models either require padded samples to learn separately from temporal and variable dimensions, or represent original samples via bipartite graphs or sets.However, the former approaches often need to handle extra padding values affecting efficiency and disrupting original sampling patterns, while the latter ones have limitations in capturing dependencies among unaligned observations.To represent and learn both dependencies from original observations in a unified form, we propose HyperIMTS, aHypergraph neural network forIrregularMultivariateTimeSeries forecasting.Observed values are converted as nodes in the hypergraph, interconnected by temporal and variable hyperedges to enable message passing among all observations.Through irregularityaware message passing, HyperIMTS captures variable dependencies in a time-adaptive way to achieve accurate forecasting. Experiments demonstrate HyperIMTS's competitive performance among state-of-the-art models in IMTS forecasting with low computational cost.
Paperid:397
Authors:Hyeonah Kim · Sanghyeok Choi · Jiwoo Son · Jinkyoo Park · Changhyun Kwon
Abstract:
Effective search methods are crucial for improving the performance of deep generative models at test time. In this paper, we introduce a novel testtime search method, Neural Genetic Search (NGS), which incorporates the evolutionary mechanism of genetic algorithms into the generation procedure of deep models. The core idea behind NGS is its crossover, which is defined as parent-conditioned generation using trained generative models. This approach offers a versatile and easy-to-implement search algorithm for deep generative models. We demonstrate the effectiveness and flexibility of NGS through experiments across three distinct domains: routing problems, adversarial prompt generation for language models, and molecular design.
Paperid:398
Authors:Haocheng Xi · Shuo Yang · Yilong Zhao · Chenfeng Xu · Muyang Li · Xiuyu Li · Yujun Lin · Han Cai · Jintao Zhang · Dacheng Li · Jianfei Chen · Ion Stoica · Kurt Keutzer · Song Han
Abstract:
Abstract:Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits realworld applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D full attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D full attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28$\times$ and 2.33$\times$ end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality. Our code will be open-sourced upon publication.
Paperid:399
Authors:Haoyun Jiang · Haolin li · jianwei zhang · Fei Huang · Qiang Hu · Minmin Sun · Shuai Xiao · Yong Li · Junyang Lin · Jiangchao Yao
Abstract:
Abstract:Large language models (LLMs) have demonstrated strong capabilities in handling longcontext tasks, but processing such long contexts remains challenging due to the substantial memory requirements and inference latency. In this work, we discover that certain attention heads exhibit sequential consistency in their attention patterns, which can be persistently identified using a coefficient-of-variation-based algorithm. Inspired by this observation, we propose CateKV, a hybrid KV cache method that retains only critical token information for consistent heads, thereby reducing KV cache size and computational overhead, while preserving the majority of KV pairs in adaptive heads to ensure high accuracy. We show the unique characteristics of our algorithm and its extension with existing acceleration methods. Comprehensive evaluations on long-context benchmarks show that, while maintaining accuracy comparable to full attention, CateKV reduces memory usage by up to $2.72\times$ and accelerates decoding by $2.18\times$ in single-sample inputs, and boosts throughput by $3.96\times$ in batch scenarios.
Paperid:400
Authors:Shentong Mo · Sukmin Yun
Abstract:
Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined \textit{GMAIL}, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multimodal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. It also shows positive generated data scaling trends and notable enhancements in the captioning performance of the large multimodal model, LLaVA.
Paperid:401
Authors:Jinghang Yue · Jing Wang · Lu Zhang · Shuo Zhang · Da Li · Zhaoyang Ma · Youfang Lin
Abstract:
Explaining deep models for timeseries data is crucial for identifying key patterns in sensitive domains, such as healthcare and finance. However, due to the lack of unified optimization criterion, existing explanation methods often suffer from redundancy and incompleteness, where irrelevant patterns are included or key patterns are missed in explanations. To address this challenge, we propose the Optimal Information Retention Principle, where conditional mutual information defines minimizing redundancy and maximizing completeness as optimization objectives. We then derive the corresponding objective function theoretically. As a practical framework, we introduce an explanation framework ORTE, learning a binary mask to eliminate redundant information while mining temporal patterns of explanations. We decouple the discrete mapping process to ensure the stability of gradient propagation, while employing contrastive learning to achieve precise filtering of explanatory patterns through the mask, thereby realizing a trade-off between low redundancy and high completeness. Extensive quantitative and qualitative experiments on synthetic and real-world datasets demonstrate that the proposed principle significantly improves the accuracy and completeness of explanations compared to baseline methods. The code is available at*.
Paperid:402
Authors:Meng Chen · Hongwei Jia · Zechen Li · Weiming Huang · Wenzhen Jia · Kai Zhao · Hongjun Dai
Abstract:
Urban region embeddings have recently seen significant progress. However, existing methods remain largely restricted to one city, which limits crosscity knowledge transfer and the re-usage of them for downstream tasks (e.g., population estimation) in other cities. To overcome this limitation, we propose the Consistency Region Embedding (CoRE), a unified framework that integrates region embedding learning with cross-city latent space alignment. CoRE embeds regions from different cities into distinct latent spaces, then aligns their latent space manifolds while ensuring fine-grained compatibility. This alignment enables region representations from different cities to be directly applied within a shared latent space, facilitating the re-usage of the model for other cities. Extensive experiments demonstrate that CoRE outperforms the state-of-the-arts baselines, underscoring its effectiveness in enabling knowledge transfer through aligned latent spaces.
Paperid:403
Authors:Fan Xu · Cheng Xin · Xin Ding · Jie Gao · Jiaxin Ding
Abstract:
Graph Neural Networks (GNNs) have shown remarkable performance in various scientific domains, but their lack of interpretability limits their applicability in critical decisionmaking processes. Recently, intrinsic interpretable GNNs have been studied to provide insights into model predictions by identifying rationale substructures in graphs. However, existing methods face challenges when the underlying rationale subgraphs are complicated and variable. To address this challenge, we propose TOPING, a novel topological framework to interpretable GNNs that leverages persistent homology to identify persistent rationale subgraphs. Our method introduces a rationale filtration learning technique that models the generating procedure of rationale subgraphs, and enforces the persistence of topological gap between rationale subgraphs and complement random graphs by a novel self-adjusted topological constraint, topological discrepancy. We show that our topological discrepancy is a lower bound of a Wasserstein distance on graph distributions with Gromov-Hausdorff metric. We provide theoretical guarantees showing that our loss is uniquely optimized by the ground truth under certain conditions. Through extensive experiments on various synthetic and real datasets, we demonstrate that TOPING effectively addresses key challenges in interpretable GNNs including handling variiform rationale subgraphs, balancing performance with interpretability, and avoiding spurious correlations. Experimental results show that our approach improves state-of-the-art on both predictive accuracy and interpretation quality.
Paperid:404
Authors:Harikrishna Metta · Venkatesh Babu Radhakrishnan
Abstract:
In Machine learning, separating data into classes is a very fundamental problem. A mathematical framework around the classes is presented in this work to deepen the understanding of classes. The classes are defined as vectors in a Vector Space, where Addition corresponds to the union of classes, and scalar multiplication resembles set complement of classes. The ZeroVector in the vector space corresponds to a class called Zero-Vector Class. This discovery enables numerous applications. One such application, termed 'clear learning' in this work, focuses on learning the true nature (manifold) of the data instead of merely learning a boundary sufficient for classification. Another application, called 'unary class learning', involves learning a single class in isolation rather than learning by comparing two or more classes. Additionally, 'set operations on classes' is another application highlighted in this work. Furthermore, Continual Learning of classes is facilitated by smaller networks. The Zero-Vector Class enables neural networks to learn only the data manifold; therefore, it can also used for generation of new data. Results for the key applications are shown using the MNIST dataset. To further strengthen the claims, some results are also produced using the CIFAR-10 dataset.
Paperid:405
Authors:Ahmad Beirami · Alekh Agarwal · Jonathan Berant · Alexander D'Amour · Jacob Eisenstein · Chirag Nagpal · Ananda Suresh
Abstract:
Abstract:A simple and effective method for the inferencetime alignment of generative models is the best-of-$n$ policy, where $n$ samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the reference policy is equal to $\log (n) - (n-1)/n.$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of-$n$ policy against the reference policy is upper bounded by $n/(n+1)$ and derive bounds on the tightness of this characterization. We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy, which demonstrate that very good tradeoffs are achievable with $n < 1000$.
Paperid:406
Authors:Mao-Lin Luo · Zi-Hao Zhou · Tong Wei · Min-Ling Zhang
Abstract:
Continual learning with visionlanguage models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (Label-specificADApter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from interference by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The supplementary material includes source code for reproducibility.
Paperid:407
Authors:Jiaqi Zhu · Shaofeng Cai · Shen · Gang Chen · Fang Deng · Beng Chin Ooi
Abstract:
Abstract:Machine learning has demonstrated transformative potential for database operations, such as query optimization and indatabase data analytics. However, dynamic database environments, characterized by frequent updates and evolving data distributions, introduce concept drift, which leads to performance degradation for learned models and limits their practical applicability. Addressing this challenge requires efficient frameworks capable of adapting to shifting concepts while minimizing the overhead of retraining or fine-tuning.In this paper, we propose FLAIR, an online adaptation framework that introduces a new paradigm called \textit{in-context adaptation} for learned database operations. FLAIR leverages the inherent property of data systems, i.e., immediate availability of execution results for predictions, to enable dynamic context construction. By formalizing adaptation as $f:(\mathbf{x} | \mathcal{C}_t) \to \mathbf{y}$, with $\mathcal{C}_t$ representing a dynamic context memory, FLAIR delivers predictions aligned with the current concept, eliminating the need for runtime parameter optimization. To achieve this, FLAIR integrates two key modules: a Task Featurization Module for encoding task-specific features into standardized representations, and a Dynamic Decision Engine, pre-trained via Bayesian meta-training, to adapt seamlessly using contextual information at runtime. Extensive experiments across key database tasks demonstrate that FLAIR outperforms state-of-the-art baselines, achieving up to $5.2\times$ faster adaptation and reducing error by 22.5\% for cardinality estimation.
Paperid:408
Authors:Christian Lange · Max Hamilton · Elijah Cole · Alexander Shepard · Samuel Heinrich · Angela Zhu · Subhransu Maji · Grant Horn · Oisin Mac Aodha
Abstract:
Knowing where a particular species can or cannot be found on Earth is crucial for ecological research and conservation efforts. By mapping the spatial ranges of all species, we would obtain deeper insights into how global biodiversity is affected by climate change and habitat loss. However, accurate range estimates are only available for a relatively small proportion of all known species. For the majority of the remaining species, we often only have a small number of records denoting the spatial locations where they have previously been observed. We outline a new approach for fewshot species range estimation to address the challenge of accurately estimating the range of a species from limited data. During inference, our model takes a set of spatial locations as input, along with optional metadata such as text or an image, and outputs a species encoding that can be used to predict the range of a previously unseen species in feed-forward manner. We validate our method on two challenging benchmarks, where we obtain state-of-the-art range estimation performance, in a fraction of the compute time, compared to recent alternative approaches.
Paperid:409
Authors:Daniel Zilberg · Ron Levie
Abstract:
We propose PieClam (Prior Inclusive Exclusive Cluster Affiliation Model): a graph autoencoder, where nodes are embedded into a code space by an algorithm that maximizes the loglikelihood of the decoded graph. PieClam is a community affiliation model that extends well-known methods like BigClam in two main manners. First, instead of the decoder being defined via pairwise interactions between the nodes in the code space, we also incorporate a learned prior on the distribution of nodes in the code space, turning our method into a graph generative model. Secondly, we generalize the notion of communities by allowing not only sets of nodes with strong connectivity, which we call inclusive communities, but also sets of nodes with strong disconnection, which we call exclusive communities. By introducing a new graph similarity measure, called the log cut distance, we show that PieClam is a universal autoencoder, able to uniformly approximately reconstruct any graph. Our method is shown to obtain competitive performance in graph anomaly detection and link prediction benchmarks.
Paperid:410
Authors:Sarthak Mittal · Eric Elmoznino · Léo Gagnon · Sangnie Bhardwaj · Guillaume Lajoie · Dhanya Sridhar
Abstract:
Large autoregressive models like Transformers can solve tasks through incontext learning (ICL) without learning new weights, suggesting avenues for efficiently solving new tasks. For many tasks, e.g., linear regression, the data factorizes: examples are independent given a task latent that generates the data, e.g., linear coefficients. While an optimal predictor leverages this factorization by inferring task latents, it is unclear if Transformers implicitly do so or instead exploit heuristics and statistical shortcuts through attention layers. In this paper, we systematically investigate the effect of explicitly inferring task latents by minimally modifying the Transformer architecture with a bottleneck to prevent shortcuts and incentivize structured solutions. We compare it against standard Transformers across various ICL tasks and find that contrary to intuition and recent works, there is little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance, in general. Curiously, we find that while the bottleneck effectively learns to extract latent task variables from context, downstream processing struggles to utilize them for robust prediction. Our study highlights the intrinsic limitations of Transformers in achieving structured ICL solutions that generalize, and shows that while inferring the right latents aids interpretability, it is not sufficient to alleviate this problem.
Paperid:411
Authors:Fan Nie · Xiaotian Hou · Shuhang Lin · James Zou · Huaxiu Yao · Linjun Zhang
Abstract:
The propensity of large language models (LLMs) to generate hallucinations and nonfactual content undermines their reliability in high-stakes domains, where rigorous control over Type I errors (the conditional probability of incorrectly classifying hallucinations as truthful content) is essential. Despite its importance, formal verification of LLM factuality with such guarantees remains largely unexplored.In this paper, we introduce FactTest, a novel framework that statistically assesses whether an LLM can provide correct answers to given questions with high-probability correctness guarantees. We formulate hallucination detection as a hypothesis testing problem to enforce an upper bound of Type I errors at user-specified significance levels. Notably, we prove that FactTest also ensures strong Type II error control under mild conditions and can be extended to maintain its effectiveness when covariate shifts exist. Our approach is distribution-free and works for any number of human-annotated samples. It is model-agnostic and applies to any black-box or white-box LM. Extensive experiments on question-answering (QA) benchmarks demonstrate that FactTest effectively detects hallucinations and enable LLMs to abstain from answering unknown questions, leading to an over 40% accuracy improvement.
Paperid:412
Authors:Xiuyuan Wang · Chaochao Chen · Weiming Liu · Xinting Liao · Fan Wang · Xiaolin Zheng
Abstract:
With growing privacy concerns and the enforcement of data protection regulations, machine unlearning has emerged as a promising approach for removing the influence of forget data while maintaining model performance on retain data. However, most existing unlearning methods require access to the original training data, which is often impractical due to privacy policies, storage constraints, and other limitations. This gives rise to the challenging task of sourcefree unlearning, where unlearning must be accomplished without access to the original training data. Few existing source-free unlearning methods rely on knowledge distillation and model retraining, which impose substantial computational costs. In this work, we propose the Data Synthesis-based Discrimination-Aware (DSDA) unlearning framework, which enables efficient source-free unlearning through two stages: (1) Accelerated Energy-Guided Data Synthesis (AEGDS), which employs Langevin dynamics to model the training data distribution while integrating Runge–Kutta methods and momentum to enhance efficiency. (2) Discrimination-Aware Multitask Optimization (DAMO), which refines the feature distribution of retain data and mitigates the gradient conflicts among multiple unlearning objectives. Extensive experiments on three benchmark datasets demonstrate that DSDA outperforms existing unlearning methods, validating its effectiveness and efficiency in source-free unlearning.
Paperid:413
Authors:Minjong Yoo · Jinwoo Jang · Sihyung Yoon · Honguk Woo
Abstract:
In embodied AI, a persistent challenge is enabling agents to robustly adapt to novel domains without requiring extensive data collection or retraining. To address this, we present a world model implanting framework (WorMI) that combines the reasoning capabilities of large language models (LLMs) with independently learned, domainspecific world models through test-time composition. By allowing seamless implantation and removal of the world models, the embodied agent's policy achieves and maintains cross-domain adaptability. In the WorMI framework, we employ a prototype-based world model retrieval approach, utilizing efficient trajectory-based abstract representation matching, to incorporate relevant models into test-time composition. We also develop a world-wise compound attention method that not only integrates the knowledge from the retrieved world models but also aligns their intermediate representations with the reasoning model's representation within the agent's policy. This framework design effectively fuses domain-specific knowledge from multiple world models, ensuring robust adaptation to unseen domains. We evaluate our WorMI on the VirtualHome and ALFWorld benchmarks, demonstrating superior zero-shot and few-shot performance compared to several LLM-based approaches across a range of unseen domains. These results highlight the framework’s potential for scalable, real-world deployment in embodied agent scenarios where adaptability and data efficiency are essential.
Paperid:414
Authors:Xiaoyu Yang · Lijian Xu · Hongsheng Li · Shaoting Zhang
Abstract:
This paper proposes a scalable and straightforward pretraining paradigm for efficient visual conceptual representation called occluded image contrastive learning (OCL). Our OCL approach is simple: we randomly mask patches to generate different views within an image and contrast them among a mini-batch of images. The core idea behind OCL consists of two designs. First, masked tokens have the potential to significantly diminish the conceptual redundancy inherent in images, and create distinct views with substantial fine-grained differences on the semantic concept level instead of the instance level. Second, contrastive learning is adept at extracting high-level semantic conceptual features during the pre-training, circumventing the high-frequency interference and additional costs associated with image reconstruction. Importantly, OCL learns highly semantic conceptual representations efficiently without relying on hand-crafted data augmentations or additional auxiliary modules. Empirically, OCL demonstrates high scalability with Vision Transformers, as the ViT-L/16 can complete pre-training in 133 hours using only 4 A100 GPUs, achieving 85.8\% accuracy in downstream fine-tuning tasks. Code is available at https://anonymous.4open.science/r/OLRS/.
Paperid:415
Authors:Bin Luo · Yuwen Huang · Jonathan Allcock · Xiaojun Lin · Shengyu Zhang · John C. S. Lui
Abstract:
Abstract:In this work, we design quantum algorithms that are more efficient than classical algorithms to solve timedependent and finite-horizon Markov Decision Processes (MDPs) in two distinct settings: (1) In the exact dynamics setting, where the agent has full knowledge of the environment's dynamics (i.e., transition probabilities), we prove that our algorithm **QVI-1** achieves a quadratic speedup in the size of the action space $(A)$ compared with the classical value iteration algorithm for computing the optimal policy ($\pi^{\*}$) and the optimal V-value function ($V_{0}^{\*}$). Furthermore, our algorithm **QVI-2** provides an additional speedup in the size of the state space $(S)$ when obtaining near-optimal policies and V-value functions. (2) In the generative model setting, where samples from the environment are accessible in quantum superposition, we prove that our algorithms **QVI-3** and **QVI-4** achieve improvements in sample complexity over the state-of-the-art (SOTA) classical algorithm in terms of $A$, estimation error $(\epsilon)$, and time horizon $(H)$. More importantly, we prove quantum lower bounds to show that **QVI-3** and **QVI-4** are nearly minimax optimal.
Paperid:416
Authors:Rina Bao · Shilong Dong · Zhenfang Chen · Sheng He · Patricia Ellen Grant · Yangming Ou
Abstract:
Medical Visual Question Answering (MVQA) requires AI models to answer questions related to medical images, offering significant potential to assist medical professionals in evaluating and diagnosing diseases, thereby improving early interventions. However, existing MVQA datasets primarily focus on basic questions regarding visual perception and pattern recognition, without addressing the more complex questions that are critical in clinical diagnosis and decisionmaking. This paper introduces a new dataset designed for professional-level medical reasoning, simulating the decision-making process. We achieve this by collecting MRI and clinical data related to Hypoxic-Ischemic Encephalopathy (HIE, a neonatal brain injury that occurs in 1-5/1000 neonates) over a decade of effort, enriched with expert annotations and insights. Building on this data, we generate clinical question-answer pairs and MRI interpretations to enable comprehensive diagnosis, interpretation, and prediction of neurocognitive outcomes. Our evaluation of current large vision-language models (LVLMs) shows limited performance on this benchmark, highlighting both the challenges of the task and the importance of this benchmark for advancing medical AI. Furthermore, we propose a novel ``Clinical Graph of Thoughts" (CGoT) model, which integrates domain-specific medical knowledge and clinical reasoning processes with the interpretive abilities of LVLMs. The model demonstrates promising results, achieving around 15% absolute gain on the most important neurocognitive outcome task, while the dataset still reveals substantial opportunities for further research innovation.
Paperid:417
Authors:Zhenyu Hou · Xin Lv · Rui Lu · Jiajie Zhang · Yujiang Li · Zijun Yao · Juanzi Li · Jie Tang · Yuxiao Dong
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective testtime scaling. While reinforcement learning (RL) holds promise for enabling self-exploration, recent attempts yield modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through oversampling. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1's better performance without any additional verification.
Paperid:418
Authors:Herman Chau · Helen Jenne · Davis Brown · Jesse He · Mark Raugas · Sara Billey · Henry Kvinge
Abstract:
With recent dramatic increases in AI system capabilities, there has been growing interest in utilizing machine learning for reasoningheavy, quantitative tasks, particularly mathematics. While there are many resources capturing mathematics at the high-school, undergraduate, and graduate level, there are far fewer resources available that align with the level of difficulty and open endedness encountered by professional mathematicians working on open problems. To address this, we introduce a new collection of datasets, the Algebraic Combinatorics Dataset Repository (ACD Repo), representing either foundational results or open problems in algebraic combinatorics, a subfield of mathematics that studies discrete structures arising from abstract algebra. Further differentiating our dataset collection is the fact that it aims at the conjecturing process. Each dataset includes a research level question and then a large collection of examples (up to 10M in some cases) from which conjectures should be generated. We describe all nine datasets, the different ways machine learning models can be applied to them (e.g., training with narrow models followed by mechanistic interpretability or program synthesis with LLMs), and document some examples where AI systems were able to re-discover non-trivial known results.
Paperid:419
Authors:Andrew Wagenmaker · Zhiyuan Zhou · Sergey Levine
Abstract:
Developing autonomous agents that quickly explore an environment and adapt their behavior online is a canonical challenge in robotics and machine learning. While humans are able to achieve such fast online exploration and adaptation, often acquiring new information and skills in only a handful of interactions, existing algorithmic approaches tend to rely on random exploration and slow, gradientbased behavior updates. How can we endow autonomous agents with such capabilities on par with humans? Taking inspiration from recent progress on both in-context learning and large-scale behavioral cloning, in this work we propose training agents to internalize what it means to explore and adapt in-context over the space of ''expert'' behaviors. To achieve this, given access to a dataset of expert demonstrations, we train a long-context generative model to predict expert actions conditioned on a context of past observations and a measure of how ''exploratory'' the expert's behaviors are relative to this context. This enables the model to not only mimic the behavior of an expert, but also, by feeding its past history of interactions into its context, to select different expert behaviors than what have been previously selected, thereby allowing for fast online adaptation and targeted, ''expert-like'' exploration. We demonstrate the effectiveness of our method in both simulated locomotion and manipulation settings, as well as on real-world robotic manipulation tasks, illustrating its ability to learn adaptive, exploratory behavior.
Paperid:420
Authors:Xinzhi Zhang · Hohei Chan · Deheng Ye · Yi Cai · Mengchen Zhao
Abstract:
The ability of agents to collaborate with previously unknown teammates on the fly, known as ad hoc teamwork (AHT), is crucial in many realworld applications. Existing approaches to AHT require online interactions with the environment and some carefully designed teammates. However, these prerequisites can be infeasible in practice. In this work, we extend the AHT problem to the offline setting, where the policy of the ego agent is directly learned from a multi-agent interaction dataset. We propose a hierarchical sequence modeling framework called TAGET that addresses critical challenges in the offline setting, including limited data, partial observability and online adaptation. The core idea of TAGET is to dynamically predict teammate-aware rewards-to-go and sub-goals, so that the ego agent can adapt to the changes of teammates' behaviors in real time. Extensive experimental results show that TAGET significantly outperforms existing solutions to AHT in the offline setting.
Paperid:421
Authors:Guoqing Zhang · Shichao Kan · Fanghui Zhang · Wanru Xu · Yue Zhang · Yigang Cen
Abstract:
Scene Graph Generation (SGG) is a fundamental task in visual understanding, aimed at providing more precise local detail comprehension for downstream applications. Existing SGG methods often overlook the diversity of predicate representations and the consistency among similar predicates when dealing with longtail distributions. As a result, the model's decision layer fails to effectively capture details from the tail end, leading to biased predictions. To address this, we propose a Noise-Guided Predicate Representation Extraction and Diffusion-Enhanced Discretization (NoDIS) method. On the one hand, expanding the predicate representation space enhances the model's ability to learn both common and rare predicates, thus reducing prediction bias caused by data scarcity. We propose a conditional diffusion model to reconstructs features and increase the diversity of representations for same category predicates. On the other hand, independent predicate representations in the decision phase increase the learning complexity of the decision layer, making accurate predictions more challenging. To address this issue, we introduce a discretization mapper that learns consistent representations among similar predicates, reducing the learning difficulty and decision ambiguity in the decision layer. To validate the effectiveness of our method, we integrate NoDIS with various SGG baseline models and conduct experiments on multiple datasets. The results consistently demonstrate superior performance.
Paperid:422
Authors:Kajetan Schweighofer · Adrián Arnaiz-Rodríguez · Sepp Hochreiter · Nuria Oliver
Abstract:
Ensembles of Deep Neural Networks, Deep Ensembles, are widely used as a simple way to boost predictive performance. However, their impact on algorithmic fairness is not well understood yet. Algorithmic fairness examines how a model's performance varies across socially relevant groups defined by protected attributes such as age, gender, or race. In this work, we explore the interplay between the performance gains from Deep Ensembles and fairness. Our analysis reveals that they unevenly favor different groups, a phenomenon we term the disparate benefits effect. We empirically investigate this effect using popular facial analysis and medical imaging datasets with protected group attributes and find that it affects multiple established group fairness metrics, including statistical parity and equal opportunity. Furthermore, we identify that the pergroup differences in predictive diversity of ensemble members can explain this effect. Finally, we demonstrate that the classical Hardt post-processing method is particularly effective at mitigating the disparate benefits effect of Deep Ensembles by leveraging their better-calibrated predictive distributions.
Paperid:423
Authors:Chunming He · Rihan Zhang · Fengyang Xiao · Chengyu Fang · Longxiang Tang · Yulun Zhang · Linghe Kong · Deng-Ping Fan · Kai Li · Sina Farsiu
Abstract:
Concealed object segmentation (COS) is a challenging problem that focuses on identifying objects that are visually blended into their background. Existing methods often employ reversible strategies to concentrate on uncertain regions but only focus on the mask level, overlooking the valuable of the RGB domain. To address this, we propose a Reversible Unfolding Network (RUN) in this paper. RUN formulates the COS task as a foregroundbackground separation process and incorporates an extra residual sparsity constraint to minimize segmentation uncertainties. The optimization solution of the proposed model is unfolded into a multistage network, allowing the original fixed parameters to become learnable. Each stage of RUN consists of two reversible modules: the Segmentation-Oriented Foreground Separation (SOFS) module and the Reconstruction-Oriented Background Extraction (ROBE) module. SOFS applies the reversible strategy at the mask level and introduces Reversible State Space to capture non-local information. ROBE extends this to the RGB domain, employing a reconstruction network to address conflicting foreground and background regions identified as distortion-prone areas, which arise from their separate estimation by independent modules. As the stages progress, RUN gradually facilitates reversible modeling of foreground and background in both the mask and RGB domains, reducing false-positive and false-negative regions. Extensive experiments demonstrate the superior performance of RUN and underscore the promise of unfolding-based frameworks for COS and other high-level vision tasks. The source code will be made publicly available.
Paperid:424
Authors:Richeng Jin · Huaiyu Dai
Abstract:
The prevalent distributed machine learning paradigm faces two critical challenges: communication efficiency and data privacy. SIGNSGD provides a simpleto-implement approach with improved communication efficiency by requiring workers to share only the signs of the gradients. However, it fails to converge in the presence of data heterogeneity, and a simple fix is to add Gaussian noise before taking the signs, which leads to the Noisy SIGNSGD algorithm that enjoys competitive performance while significantly reducing the communication overhead. Existing results suggest that Noisy SIGNSGD with additive Gaussian noise has the same privacy guarantee as classic DP-SGD due to the post-processing property of differential privacy, and logistic noise may be a good alternative to Gaussian noise when combined with the sign-based compressor. Nonetheless, discarding the magnitudes in Noisy SIGNSGD leads to information loss, which may intuitively amplify privacy. In this paper, we make this intuition rigorous and quantify the privacy amplification of the sign-based compressor. Particularly, we analytically show that Gaussian noise leads to a smaller estimation error than logistic noise when combined with the sign-based compressor and may be more suitable for distributed learning with heterogeneous data. Then, we further establish the convergence of Noisy SIGNSGD. Finally, extensive experiments are conducted to validate the theoretical results.
Paperid:425
Authors:Huigen Ye · Hua Xu · An Yan · Yaoyang Cheng
Abstract:
Large Neighborhood Search (LNS) is a widely used method for solving largescale Mixed Integer Linear Programming (MILP) problems. The effectiveness of LNS crucially depends on the choice of the search neighborhood. However, existing strategies either rely on expert knowledge or computationally expensive Machine Learning (ML) approaches, both of which struggle to scale effectively for large problems. To address this, we propose LLM-LNS, a novel Large Language Model (LLM)-driven LNS framework for large-scale MILP problems. Our approach introduces a dual-layer self-evolutionary LLM agent to automate neighborhood selection, discovering effective strategies with scant small-scale training data that generalize well to large-scale MILPs. The inner layer evolves heuristic strategies to ensure convergence, while the outer layer evolves evolutionary prompt strategies to maintain diversity. Experimental results demonstrate that the proposed dual-layer agent outperforms state-of-the-art agents such as FunSearch and EOH. Furthermore, the full LLM-LNS framework surpasses manually designed LNS algorithms like ACP, ML-based LNS methods like CL-LNS, and large-scale solvers such as Gurobi and SCIP. It also achieves superior performance compared to advanced ML-based MILP optimization frameworks like GNN&GBDT and Light-MILPopt, further validating the effectiveness of our approach.
Paperid:426
Authors:Jinpeng Chen · Runmin Cong · Yuzhi Zhao · Hongzheng Yang · Guangneng Hu · Horace Ip · Sam Kwong
Abstract:
Multimodal Continual Instruction Tuning (MCIT) aims to enable Multimodal Large Language Models (MLLMs) to incrementally learn new tasks without catastrophic forgetting, thus adapting to evolving requirements. In this paper, we explore the forgetting caused by such incremental training, categorizing it into superficial forgetting and essential forgetting. Superficial forgetting refers to cases where the model’s knowledge may not be genuinely lost, but its responses to previous tasks deviate from expected formats due to the influence of subsequent tasks’ answer styles, making the results unusable. On the other hand, essential forgetting refers to situations where the model provides correctly formatted but factually inaccurate answers, indicating a true loss of knowledge. Assessing essential forgetting necessitates addressing superficial forgetting first, as severe superficial forgetting can conceal the model’s knowledge state. Hence, we first introduce the Answer Style Diversification (ASD) paradigm, which defines a standardized process for data style transformations across different tasks, unifying their training sets into similarly diversified styles to prevent superficial forgetting caused by style shifts. Building on this, we propose RegLoRA to mitigate essential forgetting. RegLoRA stabilizes key parameters where prior knowledge is primarily stored by applying regularization to LoRA’s weight update matrices, enabling the model to retain existing competencies while remaining adaptable to new tasks. Experimental results demonstrate that our overall method, SEFE, achieves stateof-the-art performance.
Paperid:427
Authors:Xiaomin Li · Mingye Gao · Zhiwei Zhang · Jingxuan Fan · Weiyu Li
Abstract:
Reinforcement Learning from Human Feedback (RLHF) is widely used to align models with human preferences, particularly to enhance the safety of responses generated by LLMs. This method traditionally relies on choosing preferred responses from response pairs. However, due to variations in human opinions and the difficulty of making an overall comparison of two responses, there is a growing shift towards a finegrained annotation approach, assessing responses based on multiple specific metrics or rules. Selecting and applying these rules efficiently while accommodating the diversity of preference data remains a significant challenge. In this paper, we introduce a dynamic approach that adaptively selects the most critical rules for each pair of responses. We develop a mathematical framework that leverages the maximum discrepancy between each paired responses and theoretically show that this strategy optimizes the mutual information between the rule-based labeling and the hidden ground-truth preferences. We then train an 8B reward model using the adaptively labeled preference dataset and evaluate its performance on RewardBench. As of January 25, 2025, our model achieved the highest safety performance on the leaderboard, outperforming various larger models.
Paperid:428
Authors:Abudukelimu Wuerkaixi · Qizhou Wang · Sen Cui · Wutong Xu · Bo Han · Gang Niu · Masashi Sugiyama · Changshui Zhang
Abstract:
With the growing deployment of large language models (LLMs) across diverse domains, concerns regarding their safety have grown substantially.LLM unlearning has emerged as a pivotal approach to removing harmful or unlawful contents while maintaining utility.Despite increasing interest, the challenges of continual unlearning, which is common in realworld scenarios, remain underexplored.Successive unlearning tasks often lead to intensified utility degradation.To effectively unlearn targeted knowledge while preserving LLM utility, it is essential to minimize changes in model parameters by selectively updating those linked to the target knowledge, thereby ensuring other knowledge remains unaffected.Building on the task vector framework, we propose a new method named ALKN (Adaptive Localization of Knowledge Negation), which uses dynamic masking to sparsify training gradients and adaptively adjusts unlearning intensity based on inter-task relationships.Comprehensive experiments across three well-established LLM unlearning datasets demonstrate that our approach consistently outperforms baseline methods in both unlearning effectiveness and utility retention under continual unlearning settings.
Paperid:429
Authors:Ari Karchmer · Eran Malach
Abstract:
We study the relationship between gradientbased optimization of parametric models (e.g., neural networks) and optimization of linear combinations of random features. Our main result shows that if a parametric model can be learned using mini-batch stochastic gradient descent (bSGD) without making assumptions about the data distribution, then with high probability, the target function can also be approximated using a polynomial-sized combination of random features. The size of this combination depends on the number of gradient steps and numerical precision used in the bSGD process. This finding reveals fundamental limitations of distribution-free learning in neural networks trained by gradient descent, highlighting why making assumptions about data distributions is often crucial in practice.Along the way, we also introduce a new theoretical framework called average probabilistic dimension complexity (adc), which extends the probabilistic dimension complexity developed by Kamath et al. (2020). We prove that adc has a polynomial relationship with statistical query dimension, and use this relationship to demonstrate an infinite separation between adc and standard dimension complexity.
Paperid:430
Authors:Hoigi Seo · Wongi Jeong · Jae-sun Seo · Se Young Chun
Abstract:
Largescale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.
Paperid:431
Authors:Hongyang Lei · Xiaolong Cheng · Dan Wang · Qi Qin · Huazhen Huang · Yetao Wu · Qingqing Gu · Luo Ji
Abstract:
Current multimodal alignment strategies primarily use single or unified modality encoders, while optimizing the alignment on the original token space. Such a framework is easy to implement and incorporate with the pretrained knowledge, but might result in information bias. To deal with such issues, the joint encoding predictive architecture (JEPA) learns the alignment loss on the latent space, with a predictor to convert the input encoding to the output latent space. However, the application of JEPA in multimodal scenarios is limited so far. In this paper, we introduce M3Jepa, a scalable multimodal alignment framework, with the predictor implemented by a multi-directional mixture of experts (MoE). We demonstrate the framework can maximize the mutual information with information theory derivations, by alternating the optimization between different uni-directional tasks. By thoroughly designed experiments, we show that M3-Jepa can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in training and inference. Our study indicates that M3-Jepa might provide a new paradigm to self-supervised learning and open-world modeling.
Paperid:432
Authors:Jinlong Pang · Na Di · Zhaowei Zhu · Jiaheng Wei · Hao Cheng · Chen Qian · Yang Liu
Abstract:
Recent studies show that in supervised finetuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant or uninformative. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance. In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves performance across multiple downstream tasks.
Paperid:433
Authors:Shuoyuan Wang · Sharon Li · Hongxin Wei
Abstract:
Confidence calibration is critical for the safe deployment of machine learning models in the real world. However, such issue in visionlanguage models like CLIP, particularly after fine-tuning, has not been fully addressed. In this work, we demonstrate that existing prompt tuning methods usually lead to a trade-off of calibration between base and new classes: the cross-entropy loss used in standard fine-tuning (e.g., CoOp) causes overconfidence in new classes by increasing textual label divergence, whereas regularization-based tuning (e.g., KgCoOp) maintains the confidence level but results in underconfidence in base classes due to the improved accuracy. Inspired by the observations, we introduce Dynamic Outlier Regularization (DOR) to ensure the confidence calibration on both base and new classes after fine-tuning. In particular, we propose to minimize the feature deviation of novel textual labels (instead of base classes) sampled from a large vocabulary. In effect, DOR prevents the increase in textual divergence for new labels while easing restrictions on base classes. Extensive experiments demonstrate that DOR can enhance the calibration performance of current fine-tuning methods on base and new classes.
Paperid:434
Authors:Shion Takeno · Yoshito Okura · Yu Inatsu · Tatsuya Aoyama · Tomonari Tanaka · Satoshi Akahane · Hiroyuki Hanada · Noriaki Hashimoto · Taro Murayama · Hanju Lee · Shinya Kojima · Ichiro Takeuchi
Abstract:
Gaussian process regression (GPR) or kernel ridge regression is a widely used and powerful tool for nonlinear prediction. Therefore, active learning (AL) for GPR, which actively collects data labels to achieve an accurate prediction with fewer data labels, is an important problem. However, existing AL methods do not theoretically guarantee prediction accuracy for target distribution. Furthermore, as discussed in the distributionally robust learning literature, specifying the target distribution is often difficult. Thus, this paper proposes two AL methods that effectively reduce the worstcase expected error for GPR, which is the worst-case expectation in target distribution candidates. We show an upper bound of the worst-case expected squared error, which suggests that the error will be arbitrarily small by a finite number of data labels under mild conditions. Finally, we demonstrate the effectiveness of the proposed methods through synthetic and real-world datasets.
Paperid:435
Authors:Pranjal Aggarwal · Bryan Parno · Sean Welleck
Abstract:
Automated code generation with large language models has gained significant traction, but there remains no guarantee of the correctness of generated code. We aim to use formal verification to provide mathematical guarantees that the generated code is correct. However, generating formally verified code with LLMs is hindered by the scarcity of training data and the complexity of formal proofs. To tackle this challenge, we introduce AlphaVerus, a selfimproving framework that bootstraps formally verified code generation by iteratively translating programs from a higher-resource language and leveraging feedback from a verifier. AlphaVerus operates in three phases: exploration of candidate translations, Treefinement -- a novel tree search algorithm for program refinement using verifier feedback, and filtering misaligned specifications and programs to prevent reward hacking. Through this iterative process, AlphaVerus enables LLaMA-3.1-70B model to generate verified code without human intervention or model finetuning. AlphaVerus shows an ability to generate formally verified solutions for HumanEval and MBPP, laying the groundwork for truly trustworthy code-generation agents.
Paperid:436
Authors:KaShun SHUM · Yuzhen Huang · Hongjian Zou · dingqi · YiXuan Liao · Xiaoxin Chen · Qian Liu · Junxian He
Abstract:
Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., the normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmark (Huang et al., 2024). Building on this observation, we hypothesize thatdata on which model losses are predictive of downstream abilities also contribute effectively to learning. To leverage this insight, we introduce data selection based on data's Predictive strength (Preselect), a lightweight and efficient data selection method that requires training and deploying only a fastTextbased scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that PreSelect trained on 30B tokens surpasses the performance of a vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu. To facilitate further research, we will open-source our trained data selection scorer along with the curated datasets for the community.
Paperid:437
Authors:Yaoqin He · Junchen Fu · Kaiwen Zheng · Songpei Xu · Fuhai Chen · Jie Li · Joemon Jose · Xuri Ge
Abstract:
In this paper, we present a novel approach, termed DoubleFilter, to ``slim down'' the fine-tuning process of vision-language pre-trained (VLP) models via filtering redundancies in feature inputs and architectural components. We enhance the fine-tuning process using two approaches. First, we develop a new patch selection method incorporating image patch filtering through background and foreground separation, followed by a refined patch selection process. Second, we design a genetic algorithm to eliminate redundant fine-grained architecture layers, improving the efficiency and effectiveness of the model. The former makes patch selection semantics more comprehensive, improving inference efficiency while ensuring semantic representation. The latter's fine-grained layer filter removes architectural redundancy to the extent possible and mitigates the impact on performance. Experimental results demonstrate that the proposed Double-Filter achieves superior efficiency of model fine-tuning and maintains competitive performance compared with the advanced efficient fine-tuning methods on two downstream tasks, VQA and NLVR. In addition, it has been proven to be effective under METER and ViLT VLP models.
Paperid:438
Authors:Xilin Wei · Xiaoran Liu · Yuhang Zang · Xiaoyi Dong · Pan Zhang · Yuhang Cao · Jian Tong · Haodong Duan · Qipeng Guo · Jiaqi Wang · Xipeng Qiu · Dahua Lin
Abstract:
While Rotary Position Embedding (RoPE) and its variants are widely adopted for their longcontext capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge.This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work.As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH.The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors.Based on our analysis, we introduce VideoRoPE, with a 3D structure designed to preserve spatio-temporal relationships.VideoRoPE features low-frequency temporal allocation to mitigate periodic oscillations, a diagonal layout to maintain spatial symmetry, and adjustable temporal spacing to decouple temporal and spatial indexing.VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination.Our code and model weights will be publicly released.
Paperid:439
Authors:Nikolaus Howe · Ian McKenzie · Oskar Hollinsworth · Michał Zając · Tom Tseng · Aaron Tucker · Pierre-Luc Bacon · Adam Gleave
Abstract:
Language models exhibit scaling laws, whereby increasing model and dataset size predictably decreases negative log likelihood, unlocking a dazzling array of capabilities. At the same time, even the most capable systems are currently vulnerable to adversarial inputs such as jailbreaks and prompt injections, despite concerted efforts to make them robust. As compute becomes more accessible to both attackers and defenders, which side will benefit more from scale?We attempt to answer this question with a detailed study of robustness on language models spanning three orders of magnitude in parameter count. From the defender's perspective, we find that in the absence of other interventions, increasing model size alone does not consistently improve robustness. In adversarial training, we find that larger models are more sampleefficient and less compute-efficient than smaller models, and often better generalize their defense to new threat models. From the attacker’s perspective, we find that increasing attack compute smoothly and reliably increases attack success rate against both finetuned and adversarially trained models. Finally, we show that across model sizes studied, doubling compute on adversarial training only forces an attacker to less than double attack compute to maintain the same attack success rate. However, adversarial training becomes more and more effective on larger models, suggesting that defenders could eventually have the advantage with increasing model size.These results underscore the value of adopting a scaling lens when discussing robustness of frontier models.
Paperid:440
Authors:Mostafa Elhoushi · Jeff Johnson
Abstract:
We present any4, a learned 4bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods.
Paperid:441
Authors:Hongyuan Su · Yu Zheng · Yuan Yuan · Yuming Lin · Depeng Jin · Yong Li
Abstract:
Training reinforcement learning (RL) agents requires extensive trials and errors, which becomes prohibitively timeconsuming in systems with costly reward evaluations.To address this challenge, we propose adaptive reward modeling (AdaReMo) which accelerates RL training by decomposing the complicated reward function into multiple localized fast reward models approximating direct reward evaluation with neural networks.These models dynamically adapt to the agent’s evolving policy by fitting the currently explored subspace with the latest trajectories, ensuring accurate reward estimation throughout the entire training process while significantly reducing computational overhead.We empirically show that AdaReMo not only achieves over 1,000 times speedup but also improves the performance by 14.6% over state-of-the-art approaches across three expensive-to-evaluate systems---molecular generation, epidemic control, and spatial planning.Code and data for the project are provided at https://anonymous.4open.science/r/RLS-CFCE
Paperid:442
Authors:Run He · Di Fang · Yicheng Xu · Yawen Cui · Ming Li · Cen Chen · Ziqian Zeng · HUIPING ZHUANG
Abstract:
ExemplarFree Class-Incremental Learning (EFCIL) aims to sequentially learn from distinct categories without retaining exemplars but easily suffers from catastrophic forgetting of learned knowledge. While existing EFCIL methods leverage knowledge distillation to alleviate forgetting, they still face two critical challenges: semantic shift and decision bias. Specifically, the embeddings of old tasks shift in the embedding space after learning new tasks, and the classifier becomes biased towards new tasks due to training solely with new data, thereby hindering the balance between old and new knowledge. To address these issues, we propose the Dual-Projection Shift Estimation and Classifier Reconstruction (DPCR) approach for EFCIL. DPCR effectively estimates semantic shift through a dual-projection, which combines a learnable transformation with a row-space projection to capture both task-wise and category-wise shifts. Furthermore, to mitigate decision bias, DPCR employs ridge regression to reformulate classifier training as a reconstruction process. This reconstruction exploits previous information encoded in covariance and prototype of each class after calibration with estimated shift, thereby reducing decision bias. Extensive experiments demonstrate that, across various datasets, DPCR effectively balances old and new tasks, outperforming state-of-the-art EFCIL methods.
Paperid:443
Authors:Dujian Ding · Ankur Mallick · Shaokun Zhang · Chi Wang · Daniel Madrigal · Mirian Hipolito Garcia · Menglin Xia · Laks V.S Lakshmanan · Qingyun Wu · Victor Ruehle
Abstract:
Large language models (LLMs) are powerful tools but are often expensive to deploy at scale. LLM query routing mitigates this by dynamically assigning queries to models of varying cost and quality to obtain a desired tradeoff. Prior query routing approaches generate only one response from the selected model and a single response from a small (inexpensive) model was often not good enough to beat a response from a large (expensive) model due to which they end up overusing the large model and missing out on potential cost savings. However, it is well known that for small models, generating multiple responses and selecting the best can enhance quality while remaining cheaper than a single largemodel response. We leverage this idea to propose BEST-Route, a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds. Experiments on real-world datasets demonstrate that our method reduces costs by up to 60\% with less than 1\% performance drop.
Paperid:444
Authors:Grant Forbes · Leonardo Villalobos-Arias · Jianxun Wang · Arnav Jhala · David Roberts
Abstract:
Recent RL research has utilized reward shapingparticularly complex shaping rewards such as intrinsic motivation (IM)--to encourage agent exploration in sparse-reward environments. While often effective, ``reward hacking'' can lead to the shaping reward being optimized at the expense of the extrinsic reward, resulting in a suboptimal policy. Potential-Based Reward Shaping (PBRS) techniques such as Generalized Reward Matching (GRM) and Policy-Invariant Explicit Shaping (PIES) have mitigated this. These methods allow for implementing IM without altering optimal policies. In this work we show that they are effectively unsuitable for complex, exploration-heavy environments with long-duration episodes. To remedy this, we introduce Action-Dependent Optimality Preserving Shaping (ADOPS), a method of converting intrinsic rewards to an optimality-preserving form that allows agents to utilize IM more effectively in the extremely sparse environment of Montezuma's Revenge. We also prove ADOPS accommodates reward shaping functions that cannot be written in a potential-based form: while PBRS-based methods require the cumulative discounted intrinsic return be independent of actions, ADOPS allows for intrinsic cumulative returns to be dependent on agents' actions while still preserving the optimal policy set. We show how action-dependence enables ADOPS's to preserve optimality while learning in complex, sparse-reward environments where other methods struggle.
Paperid:445
Authors:Qihong Song · XitingLiu · Hongyuan Zhu · Joey Tianyi Zhou · Xi Peng · Peng Hu
Abstract:
Recently, deep unsupervised hashing has gained considerable attention in image retrieval due to its advantages in costfree data labeling, computational efficiency, and storage savings. Although existing methods achieve promising performance by leveraging inherent visual structures within data, they primarily focus on learning discriminative features from unlabeled images through the limited internal knowledge, resulting in an intrinsic upper bound on performance. To break through this intrinsic limitation, we propose a novel method called Deep Unsupervised Hashing with External Guidance (DUH-EG), which incorporates external textual knowledge as semantic guidance to enhance discrete representation learning. Specifically, our DUH-EG: i) first selects representative semantic nouns from an external textual database by minimizing their redundancy, then matches images with them to extract more discriminative external features; and ii) presents a novel bidirectional contrastive learning mechanism to maximize agreement between hash codes in internal and external spaces, thereby capturing discrimination from both external and intrinsic structures in Hamming space. Extensive experiments on four benchmark datasets demonstrate that our DUH-EG remarkably outperforms existing state-of-the-art hashing methods.
Paperid:446
Authors:Lei Yan · Xin Zhang · Qing Mai
Abstract:
Scientific and engineering applications are often heterogeneous, making it beneficial to account for latent clusters or subpopulations when learning low-dimensional subspaces in supervised learning, and vice versa. In this paper, we combine the concept of subspace clustering with model-based sufficient dimension reduction and thus generalize the sufficient dimension reduction framework from homogeneous regression setting to heterogeneous data applications. In particular, we propose the mixture of principal fitted components (mixPFC) model, a novel framework that simultaneously achieves clustering, subspace estimation, and variable selection, providing a unified solution for high-dimensional heterogeneous data analysis. We develop a group Lasso penalized expectation-maximization (EM) algorithm and obtain its non-asymptotic convergence rate. Through extensive simulation studies, mixPFC demonstrates superior performance compared to existing methods across various settings. Applications to real world datasets further highlight its effectiveness and practical advantages.
Paperid:447
Authors:Yuzhong Hong · Hanshan Zhang · Junwei Bao · Hongfei Jiang · yang song
Abstract:
Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KLconstrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify a significant issue: the DPO loss might have multiple minimizers, of which only one satisfies the required linearity condition. The problem arises from a well-known issue of the underlying Bradley-Terry preference model: it does not always have a unique maximum likelihood estimator (MLE). Consequently, the minimizer of the RLHF loss might be unattainable because it is merely one among many minimizers of the DPO loss. As a better alternative, we propose an energy-based preference model (EBM) that always has a unique MLE, inherently satisfying the linearity requirement. To showcase the practical utility of replacing BTM with our EBM in the context of offline alignment, we adapt a simple yet scalable objective function from the recent literature on fitting EBM and name it as Energy Preference Alignment (EPA). Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby validating the theoretical superiority of our EBM.
Paperid:448
Authors:Suning Huang · Zheyu Zhang · Tianhai Liang · Yihan Xu · Zhehao Kou · Chenhao Lu · Guowei Xu · Zhengrong Xue · Huazhe Xu
Abstract:
Visual deep reinforcement learning (RL) enables robots to acquire skills from visual input for unstructured tasks. However, current algorithms suffer from low sample efficiency, limiting their practical applicability. In this work, we present MENTOR, a method that improves both the architecture and optimization of RL agents. Specifically, MENTOR replaces the standard multilayer perceptron (MLP) with a mixture-of-experts (MoE) backbone and introduces a task-oriented perturbation mechanism. MENTOR outperforms state-of-the-art methods across three simulation benchmarks and achieves an average of 83% success rate on three challenging real-world robotic manipulation tasks, significantly surpassing the 32% success rate of the strongest existing model-free visual RL algorithm. These results underscore the importance of sample efficiency in advancing visual RL for real-world robotics. Experimental videos are available at https://mentor-vrl.github.io/.
Paperid:449
Authors:Louis Airale · Antonio Longa · Mattia Rigon · Andrea Passerini · Roberto Passerone
Abstract:
Graph transformers extend global selfattention to graph-structured data, achieving notable success in graph learning. Recently, random walk structural encoding (RWSE) has been found to further enhance their predictive power by encoding both structural and positional information into the edge representation. However, RWSE cannot always distinguish between edges that belong to different local graph patterns, which reduces its ability to capture the full structural complexity of graphs.This work introduces Simple Path Structural Encoding (SPSE), a novel method that utilizes simple path counts for edge encoding. We show theoretically and experimentally that SPSE overcomes the limitations of RWSE, providing a richer representation of graph structures, particularly in capturing local cyclic patterns. To make SPSE computationally tractable, we propose an efficient approximate algorithm for simple path counting.SPSE demonstrates significant performance improvements over RWSE on various benchmarks, including molecular and long-range graph datasets, achieving statistically significant gains in discriminative tasks. These results pose SPSE as a powerful edge encoding alternative for enhancing the expressivity of graph transformers.
Paperid:450
Authors:Yinfeng Chen · Jin Liu · Rui Qiu
Abstract:
Abstract:The normal vectors obtained from the support vector machine (SVM) method offer the potential to achieve sufficient dimension reduction in both classification and regression scenarios. Motivated by it, we in this paper introduce a unified framework for nonlinear sufficient dimension reduction based on classification ensemble. Kernel principal SVM, a nonlinear sufficient dimension reduction method using the reproducing kernel Hilbert space and SVM, can almost be regarded as a special case of this framework. In particular, the neural network function class is considered to replace the reproducing kernel Hilbert space to implement a more flexible deep nonlinear sufficient dimension reduction. Its advantages are particularly prominent in cases of severe data contamination, large samples and complex inputs. Theoretically, we demonstrate that the optimal solution of the population objective function is unbiased in the sense of the central $\sigma$field. A nonasymptotic upper bound for the estimation error is also provided. Through simulations and real data analysis, our proposed deep nonlinear sufficient dimension reduction method exhibits considerable competitiveness.
Paperid:451
Authors:Jonathan Geuter · Gregor Kornhardt · Ingimar Tomasson · Vaios Laschos
Abstract:
Optimal Transport (OT) problems are a cornerstone of many applications, but solving them is computationally expensive. To address this problem, we propose UNOT (Universal Neural Optimal Transport), a novel framework capable of accurately predicting (entropic) OT distances and plans between discrete measures of variable resolution for a given cost function. UNOT builds on Fourier Neural Operators, a universal class of neural networks that map between function spaces and that are discretizationinvariant, which enables our network to process measures of varying sizes. The network is trained adversarially using a second, generating network and a self-supervised bootstrapping loss. We theoretically justify the use of FNOs, prove that our generator is universal, and that minimizing the bootstrapping loss provably minimizes the ground truth loss. Through extensive experiments, we show that our network not only accurately predicts optimal transport distances and plans across a wide range of datasets, but also captures the geometry of the Wasserstein space correctly. Furthermore, we show that our network can be used as a state-of-the-art initialization for the Sinkhorn algorithm, significantly outperforming existing approaches.
Paperid:452
Authors:Emanuele Palumbo · Moritz Vandenhirtz · Alain Ryser · Imant Daunhawer · Julia Vogt
Abstract:
The hierarchical structure inherent in many realworld datasets makes the modeling of such hierarchies a crucial objective in both unsupervised and supervised machine learning. While recent advancements have introduced deep architectures specifically designed for hierarchical clustering, we adopt a critical perspective on this line of research. Our findings reveal that these methods face significant limitations in scalability and performance when applied to realistic datasets. Given these findings, we present an alternative approach and introduce a lightweight method that builds on pre-trained non-hierarchical clustering models. Remarkably, our approach outperforms specialized deep models for hierarchical clustering, and it is broadly applicable to any pre-trained clustering model that outputs logits, without requiring any fine-tuning. To highlight the generality of our approach, we extend its application to a supervised setting, demonstrating its ability to recover meaningful hierarchies from a pre-trained ImageNet classifier. Our results establish a practical and effective alternative to existing deep hierarchical clustering methods, with significant advantages in efficiency, scalability and performance.
Paperid:453
Authors:Eric Elmoznino · Tom Marty · Tejas Kasetty · Léo Gagnon · Sarthak Mittal · Mahan Fathi · Dhanya Sridhar · Guillaume Lajoie
Abstract:
A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best—a principle called Occam's razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam's razor and incontext learning—an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved.
Paperid:454
Authors:Aaron Wenteler · Martina Occhetta · Nikhil Branson · Victor Curean · Magdalena Huebner · William Dee · William Connell · Siu Chung · Alex Hawkins-Hooker · Yasha Ektefaie · César Miguel Valdez Córdova · Amaya Gallagher-Syed
Abstract:
In silicomodeling of transcriptional responses to perturbations is crucial for advancing our understanding of cellular processes and disease mechanisms. We present PertEvalscFM, a standardized framework designed to evaluate models for perturbation effect prediction. We apply PertEval-scFM to benchmark zero-shot single-cell foundation model (scFM) embeddings against simpler baseline models to assess whether these contextualized representations enhance perturbation effect prediction. Our results show that scFM embeddings do not provide consistent improvements over baseline models, especially under distribution shift. Additionally, all models struggle with predicting strong or atypical perturbation effects. Overall, this study provides a systematic evaluation of zero-shot scFM embeddings for perturbation effect prediction, highlighting the challenges of this task and revealing the limitations of current-generation scFMs. Our findings underscore the need for specialized models and high-quality datasets that capture a broader range of cellular states. Source code and documentation can be found at: https://anonymous.4open.science/r/PertEval-C674/README.md.
Paperid:455
Authors:Suyuan Liu · Hao Yu · Hao Tan · KE LIANG · Siwei Wang · Shengju Yu · En Zhu · Xinwang Liu
Abstract:
Multiview clustering (MVC) leverages complementary information from diverse data sources to enhance clustering performance. However, its practical deployment in distributed and privacy-sensitive scenarios remains challenging. Federated multi-view clustering (FMVC) has emerged as a potential solution, but existing approaches suffer from substantial limitations, including excessive communication overhead, insufficient privacy protection, and inadequate handling of missing views. To address these issues, we propose Efficient Federated Incomplete Multi-View Clustering (EFIMVC), a novel framework that introduces a localized optimization strategy to significantly reduce communication costs while ensuring theoretical convergence. EFIMVC employs both view-specific and shared anchor graphs as communication variables, thereby enhancing privacy by avoiding the transmission of sensitive embeddings. To further improve robustness, we design a dual-anchor alignment mechanism that mitigates anchor misalignment during graph fusion. Moreover, EFIMVC seamlessly extends to scenarios with missing views, making it a practical and scalable solution for real-world applications. Extensive experiments on benchmark datasets demonstrate the superiority of EFIMVC in clustering accuracy, communication efficiency, and privacy preservation.
Paperid:456
Authors:Tianqi Du · Haotian Huang · Yifei Wang · Yisen Wang
Abstract:
Large language models (LLMs) have exhibited impressive performance and surprising emergent properties. However, their effectiveness is still limited by the fixed context window of the transformer architecture, posing challenges for length generalization. In this work, we propose a new perspective on length generalization by focusing on the output distribution, introducing a shift from the conventional emphasis on input features such as positional encodings or data structures. Specifically, through case studies on synthetic tasks, we highlight the critical role of \textbf{output alignment}the consistency of output distributions across sequences of varying lengths. Extending this insight to natural language tasks, we propose a metric called Long-Short Misalignment to quantify output alignment, uncovering a strong correlation between this metric and length generalization performance. Building on these findings, we develop a regularization loss that enhances output alignment during training. Extensive experiments validate the effectiveness of our approach, offering a fresh perspective for understanding and improving length generalization in LLMs.
Paperid:457
Authors:Guangting Zheng · Yehao Li · Yingwei Pan · Jiajun Deng · Ting Yao · Yanyong Zhang · Tao Mei
Abstract:
Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current defacto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs.
Paperid:458
Authors:Tianze Yang · Yucheng Shi · Mengnan Du · Xuansheng Wu · Qiaoyu Tan · Jin Sun · Ninghao Liu
Abstract:
VectorQuantized Generative Models (VQGMs) have emerged as powerful tools for image generation. However, the key component of VQGMs---the codebook of discrete tokens---is still not well understood, e.g., which tokens are critical to generate an image of a certain concept?This paper introduces Concept-Oriented Token Explanation (CORTEX), a novel approach for interpreting VQGMs by identifying concept-specific token combinations. Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens. Experimental results demonstrate CORTEX's efficacy in providing clear explanations of token usage in the generative process, outperforming baselines across multiple pretrained VQGMs. Besides enhancing VQGMs transparency, CORTEX is useful in applications such as targeted image editing and shortcut feature detection.
Paperid:459
Authors:Markelle Kelly · Alex Boyd · Samuel Showalter · Padhraic Smyth · Mark Steyvers
Abstract:
Abstract:Applications of machine learning often involve making predictions based on both model outputs and the opinions of human experts. In this context, we investigate the prediction of human expert consensus for $K$ary classification problems. Given a budget for querying experts, the objective is to achieve the maximum accuracy (in terms of predicting expert consensus) while leveraging the class probability estimates of pre-trained classifiers. We develop a general Bayesian framework for this problem, modeling expert correlation via a joint latent representation, enabling simulation-based inference about the utility of additional expert queries, as well as inference of posterior distributions over unobserved expert labels. We apply our approach to two real-world medical classification problems, as well as to CIFAR-10H and ImageNet-16H, demonstrating substantial reductions in the cost of querying human experts while maintaining high prediction accuracy.
Paperid:460
Authors:Rui Zhang · Yun Shen · Hongwei Li · Wenbo Jiang · Hanxiao Chen · Yuan Zhang · Guowen Xu · Yang Zhang
Abstract:
Recent research highlights concerns about the trustworthiness of thirdparty Pre-Trained Language Models (PTLMs) due to potential backdoor attacks.These backdoored PTLMs, however, are effective only for specific pre-defined downstream tasks.In reality, these PTLMs can be adapted to many other unrelated downstream tasks.Such adaptation may lead to unforeseen consequences in downstream model outputs, consequently raising user suspicion and compromising attack stealthiness.We refer to this phenomenon as backdoor complications.In this paper, we undertake the first comprehensive quantification of backdoor complications.Through extensive experiments using 4 prominent PTLMs and 16 text classification benchmark datasets, we demonstrate the widespread presence of backdoor complications in downstream models fine-tuned from backdoored PTLMs.The output distribution of triggered samples significantly deviates from that of clean samples.Consequently, we propose a backdoor complication reduction method leveraging multi-task learning to mitigate complications without prior knowledge of downstream tasks.The experimental results demonstrate that our proposed method can effectively reduce complications while maintaining the efficacy and consistency of backdoor attacks.
Paperid:461
Authors:Jingbang Chen · Xinyuan Cao · Alicia Stepin · Li Chen
Abstract:
Abstract:We study learningaugmented binary search trees (BSTs) via Treaps with carefully designed priorities.The result is a simple search tree in which the depth of each item $x$ is determined by its predicted weight $w_x$.Specifically, each item $x$ is assigned a composite priority of $-\lfloor\log\log(1/w_x)\rfloor + U(0, 1)$ where $U(0, 1)$ is the uniform random variable. By choosing $w_x$ as the relative frequency of $x$, the resulting search trees achieve static optimality.This approach generalizes the recent learning-augmented BSTs [Lin-Luo-Woodruff ICML`22], which only work for Zipfian distributions, by extending them to arbitrary input distributions.Furthermore, we demonstrate that our method can be generalized to a B-Tree data structure using the B-Treap approach [Golovin ICALP'09]. Our search trees are also capable of leveraging localities in the access sequence through online self-reorganization, thereby achieving the working-set property. Additionally, they are robust to prediction errors and support dynamic operations, such as insertions, deletions, and prediction updates. We complement our analysis with an empirical study, demonstrating that our method outperforms prior work and classic data structures.
Paperid:462
Authors:Xubin Wang · Jianfei Wu · Yuan Yichen · Mingzhe Li · Deyu Cai · Weijia Jia
Abstract:
Diversity in demonstration selection is crucial for enhancing model generalization, as it enables a broader coverage of structures and concepts. However, constructing an appropriate set of demonstrations has remained a focal point of research. This paper presents the RelevanceDiversity Enhanced Selection (RDES), an innovative approach that leverages reinforcement learning to optimize the selection of diverse reference demonstrations for text classification tasks using Large Language Models (LLMs), especially in few-shot prompting scenarios. RDES employs a Q-learning framework to dynamically identify demonstrations that maximize both diversity and relevance to the classification objective by calculating a diversity score based on label distribution among selected demonstrations. This method ensures a balanced representation of reference data, leading to improved classification accuracy. Through extensive experiments on four benchmark datasets and involving 12 closed-source and open-source LLMs, we demonstrate that RDES significantly enhances classification accuracy compared to ten established baselines. Furthermore, we investigate the incorporation of Chain-of-Thought (CoT) reasoning in the reasoning process, which further enhances the model's predictive performance. The results underscore the potential of reinforcement learning to facilitate adaptive demonstration selection and deepen the understanding of classification challenges.
Paperid:463
Authors:David Boetius · Stefan Leue · Tobias Sutter
Abstract:
Probabilistic verification of neural networks is concerned with formally analysing the output distribution of a neural network under a probability distribution of the inputs. Examples of probabilistic verification include verifying the demographic parity fairness notion or quantifying the safety of a neural network. We present a new algorithm for the probabilistic verification of neural networks based on an algorithm for computing and iteratively refining lower and upper bounds on probabilities over the outputs of a neural network. By applying stateof-the-art bound propagation and branch and bound techniques from non-probabilistic neural network verification, our algorithm significantly outpaces existing probabilistic verification algorithms, reducing solving times for various benchmarks from the literature from tens of minutes to tens of seconds. Furthermore, our algorithm compares favourably even to dedicated algorithms for restricted subsets of probabilistic verification. We complement our empirical evaluation with a theoretical analysis, proving that our algorithm is sound and, under mildly restrictive conditions, also complete when using a suitable set of heuristics.
Paperid:464
Authors:Steinar Laenen · Peter Macgregor · He Sun
Abstract:
Abstract:In the kernel density estimation (KDE) problem, we are given a set $X$ of data points in $\mathbb{R}^d$, a kernel function $k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$, and a query point $\mathbf{q} \in \mathbb{R}^d$, and the objective is to quickly output an estimate of $\sum_{\mathbf{x} \in X} k(\mathbf{q}, \mathbf{x})$.In this paper, we consider $\textsf{KDE}$ in the dynamic setting, and introduce a data structure that efficiently maintains the _estimates_ for a set of query points as data points are added to $X$ over time.Based on this, we design a dynamic data structure that maintains a sparse approximation of the fully connected similarity graph on $X$, and develop a fast dynamic spectral clustering algorithm.We further evaluate the effectiveness of our algorithms on both synthetic and realworld datasets.
Paperid:465
Authors:Swetha Ganesh · Washim Mondal · Vaneet Aggarwal
Abstract:
Abstract:This work examines averagereward reinforcement learning with general policy parametrization. Existing state-of-the-art (SOTA) guarantees for this problem are either suboptimal or hindered by several challenges, including poor scalability with respect to the size of the state-action space, high iteration complexity, and dependence on knowledge of mixing times and hitting times. To address these limitations, we propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm. Our work is the first to achieve a global convergence rate of $\Tilde{\mathcal{O}}(1/\sqrt{T})$ for average-reward Markov Decision Processes (MDPs) (where $T$ is the horizon length), without requiring the knowledge of mixing and hitting times. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.
Paperid:466
Authors:Boyuan Wu · wang · Xianwei Lin · Jiachun Xu · Jikai Yu · Zhou Shicheng · Hongda Chen · Lianxin Hu
Abstract:
Whole Slide Image (WSI) analysis is framed as a Multiple Instance Learning (MIL) problem, but existing methods struggle with nonstackable data due to inconsistent instance lengths, which degrades performance and efficiency. We propose a Distributed Parallel Gradient Stacking (DPGS) framework with Deep Model-Gradient Compression (DMGC) to address this. DPGS enables lossless MIL data stacking for the first time, while DMGC accelerates distributed training via joint gradient-model compression. Experiments on Camelyon16 and TCGA-Lung datasets demonstrate up to 31× faster training, up to a 99.2% reduction in model communication size at convergence, and up to a 9.3% improvement in accuracy compared to the baseline. To our knowledge, this is the first work to solve non-stackable data in MIL while improving both speed and accuracy.
Paperid:467
Authors:Ankit Ghosh · Gargee Kashyap · Sarthak Mittal · Nupur Jain · Raghavan B Sunoj · Abir De
Abstract:
Yield of chemical reactions generally depends on the activation barrier, i.e., the energy difference between the reactant and the transition state. Computing the transition state from the reactant and product graphs requires prior knowledge of the correct node alignment (i.e., atom mapping), which is not available in yield prediction datasets. In this work, we propose YieldNet, a neural yield prediction model, which tackles these challenges. Here, we first approximate the atom mapping between the reactants and products using a differentiable node alignment network. We then use this approximate atom mapping to obtain a noisy realization of the condensed graph of reaction (CGR), which is a supergraph encompassing both the reactants and products. This CGR serves as a surrogate for the transition state graph structure. The CGR embeddings of different steps in a multistep reaction are then passed into a transformer-guided reaction path encoder.Our experiments show that YieldNet can predict the yield more accurately than the baselines. Furthermore, the model is trained only under the distant supervision of yield values, without requiring fine-grained supervision of atom mapping.
Paperid:468
Authors:Filip Szatkowski · Yaoyue Zheng · Fei Yang · Tomasz Trzcinski · Bartłomiej Twardowski · Joost van de Weijer
Abstract:
Continual learning is crucial for applying machine learning in challenging, dynamic, and often resourceconstrained environments. However, catastrophic forgetting — overwriting previously learned knowledge when new information is acquired — remains a major challenge. In this work, we examine the intermediate representations in neural network layers during continual learning and find that such representations are less prone to forgetting, highlighting their potential to accelerate computation. Motivated by these findings, we propose to use auxiliary classifiers~(ACs) to enhance performance and demonstrate that integrating ACs into various continual learning methods consistently improves accuracy across diverse evaluation settings, yielding an average 10\% relative gain. We also leverage the ACs to reduce the average cost of the inference by 10-60\% without compromising accuracy, enabling the model to return the predictions before computing all the layers. Our approach provides a scalable and efficient solution for continual learning.
Paperid:469
Authors:Jerry Huang · Peng Lu · QIUHAO Zeng
Abstract:
Recent advances in natural language processing (NLP) have opened up greater opportunities to enable finetuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open-sourced LLMs, identifying significant calibration degradation after instruction tuning in each. Seeking a practical solution, we look towards label smoothing, which has been shown as an effective method to regularize for overconfident predictions but has yet to be widely adopted in the supervised fine-tuning (SFT) of LLMs. We first provide insight as to why label smoothing is sufficient to maintain calibration throughout the SFT process. However, settings remain where the effectiveness of smoothing is severely diminished, in particular the case of large vocabulary LLMs (LV-LLMs). We posit the cause to stem from the ability to become over-confident, which has a direct relationship with the hidden size and vocabulary size, and justify this theoretically and experimentally. Finally, we address an outstanding issue regarding the memory footprint of the cross-entropy loss computation in the label smoothed loss setting, designing a customized kernel to dramatically reduce memory consumption without sacrificing speed or performance in comparison to existing solutions for non-smoothed losses.
Paperid:470
Authors:Francesco Emanuele Stradi · Anna Lunghi · Matteo Castiglioni · Alberto Marchesi · Nicola Gatti
Abstract:
Abstract:We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, stradi et al. (2024) proposed the first bestof-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation, while, when they are adversarial, it attains $\widetilde{\mathcal{O}}(\sqrt{T})$ constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.
Paperid:471
Authors:Zhiqiang Wang · Jianghao Wen · Jianqing Liang
Abstract:
Dynamic graph representation learning using Spiking Neural Networks (SNNs) exploits the temporal spiking behavior of neurons, offering advantages in capturing the temporal evolution and sparsity of dynamic graphs. However, existing SNNbased methods often fail to effectively capture the impact of latency in information propagation on node representations. To address this, we propose Delay-DSGN, a dynamic spiking graph neural network incorporating a learnable delay mechanism. By leveraging synaptic plasticity, the model dynamically adjusts connection weights and propagation speeds, enhancing temporal correlations and enabling historical data to influence future representations. Specifically, we introduce a Gaussian delay kernel into the neighborhood aggregation process at each time step, adaptively delaying historical information to future time steps and mitigating information forgetting. Experiments on three large-scale dynamic graph datasets demonstrate that Delay-DSGN outperforms eight state-of-the-art methods, achieving the best results in node classification tasks. We also theoretically derive the constraint conditions between the Gaussian kernel's standard deviation and size, ensuring stable training and preventing gradient explosion and vanishing issues.
Paperid:472
Authors:Robert Busa-Fekete · Travis Dick · Claudio Gentile · Haim Kaplan · Tomer Koren · Uri Stemmer
Abstract:
We investigate Learning from Label Proportions (LLP), a partial information setting where examples in a training set are grouped into bags, and only aggregate label values in each bag are available. Despite the partial observability, the goal is still to achieve small regret at the level of individual examples. We give results on the sample complexity of LLP under square loss, showing that our sample complexity is essentially optimal. From an algorithmic viewpoint, we rely on carefully designed variants of Empirical Risk Minimization, and Stochastic Gradient Descent algorithms, combined with ad hoc variance reduction techniques. On one hand, our theoretical results improve in important ways on the existing literature on LLP, specifically in the way the sample complexity depends on the bag size. On the other hand, we validate our algorithmic solutions on several datasets, demonstrating improved empirical performance (better accuracy for less samples) against recent baselines.
Paperid:473
Authors:Yusuke Yamasaki · Kenta Niwa · Takumi Fukami · Takayuki Miura · Daiki Chijiwa
Abstract:
We propose Plausible Token Amplification (PTA) to improve the accuracy of Differentially Private InContext Learning (DP-ICL) using DP synthetic demonstrations. While Tang et al. empirically improved the accuracy of DP-ICL by limiting vocabulary space during DP synthetic demonstration generation, its theoretical basis remains unexplored. By interpreting ICL as implicit Bayesian inference on a concept underlying demonstrations, we not only provide theoretical evidence supporting Tang et al.'s empirical method but also introduce PTA, a refined method for modifying next-token probability distribution. Through the modification, PTA highlights tokens that distinctly represent the ground-truth concept underlying the original demonstrations. As a result, generated DP synthetic demonstrations guide the Large Language Model to successfully infer the ground-truth concept, which improves the accuracy of DP-ICL. Experimental evaluations on both synthetic and real-world text-classification datasets validated the effectiveness of PTA.
Paperid:474
Authors:Chong Zhang · Peng Zhang · Mengke Li
Abstract:
Abstract:Scattering characteristics of synthetic aperture radar (SAR) targets are typically related to observed azimuth and depression angles. However, in practice, it is difficult to obtain adequate training samples at all observation angles, which probably leads to poor robustness of deep networks. In this paper, we first propose a GammaDistribution Principal Component Analysis ($\Gamma$PCA) model that fully accounts for the statistical characteristics of SAR data. The $\Gamma$PCA derives consistency convolution kernels to effectively capture the angle-invariant features of the same target at various attitude angles, thus alleviating deep models' sensitivity to angle changes in SAR target recognition task. We validate $\Gamma$PCA model based on two different backbones, ResNet and ViT, and conducted multiple robustness experiments on the MSTAR benchmark dataset. The experimental results demonstrated that $\Gamma$PCA effectively enables the model to withstand substantial distributional discrepancy caused by angle changes. Additionally, $\Gamma$PCA convolution kernel is designed to require no parameter updates, thereby bringing no computational burden to a network.
Paperid:475
Authors:Ziyang Wu · Jingyuan Zhang · Druv Pai · XuDong Wang · Chandan Singh · Jianwei Yang · Jianfeng Gao · Yi Ma
Abstract:
DINO and DINOv2 are two model families being widely used to learn representations from unlabeled imagery data at large scales. Their learned representations often enable stateof-the-art performance for downstream tasks, such as image classification and segmentation. However, they employ many empirically motivated design choices and their training pipelines are highly complex and unstable --- many hyperparameters need to be carefully tuned to ensure that the representations do not collapse --- which poses considerable difficulty to improving them or adapting them to new domains. In this work, we posit that we can remove most such-motivated idiosyncrasies in the pre-training pipelines, and only need to add an explicit coding rate term in the loss function to avoid collapse of the representations. As a result, we obtain highly simplified variants of the DINO and DINOv2 which we call SimDINO and SimDINOv2, respectively. Remarkably, these simplified models are more robust to different design choices, such as network architecture and hyperparameters, and they learn even higher-quality representations, measured by performance on downstream tasks, offering a Pareto improvement over the corresponding DINO and DINOv2 models. This work highlights the potential of using simplifying design principles to improve the empirical practice of deep learning.
Paperid:476
Authors:Chenyou Fan · Chenjia Bai · Zhao Shan · Haoran He · Yang Zhang · Zhen Wang
Abstract:
Diffusion models have demonstrated their capabilities in modeling trajectories of multitasks. However, existing multi-task planners or policies typically rely on task-specific demonstrations via multi-task imitation, or require task-specific reward labels to facilitate policy optimization via Reinforcement Learning (RL). They are costly due to the substantial human efforts required to collect expert data or design reward functions. To address these challenges, we aim to develop a versatile diffusion planner capable of leveraging large-scale inferior data that contains task-agnostic sub-optimal trajectories, with the ability to fast adapt to specific tasks. In this paper, we propose SODP, a two-stage framework that leverages Sub-Optimal data to learn a Diffusion Planner, which is generalizable for various downstream tasks. Specifically, in the pre-training stage, we train a foundation diffusion planner that extracts general planning capabilities by modeling the versatile distribution of multi-task trajectories, which can be sub-optimal and has wide data coverage. Then for downstream tasks, we adopt RL-based fine-tuning with task-specific rewards to quickly refine the diffusion planner, which aims to generate action sequences with higher task-specific returns. Experimental results from multi-task domains including Meta-World and Adroit demonstrate that SODP outperforms state-of-the-art methods with only a small amount of data for reward-guided fine-tuning.
Paperid:477
Authors:Yuxuan Zhou · Xien Liu · Chenwei Yan · Chen Ning · Xiao Zhang · Boxun Li · Xiangling Fu · Shijin Wang · Guoping Hu · Yu Wang · Ji Wu
Abstract:
Large language models (LLMs) have demonstrated remarkable performance on various medical benchmarks, but their capabilities across different cognitive levels remain underexplored. Inspired by Bloom's Taxonomy, we propose a multicognitive-level evaluation framework for assessing LLMs in the medical domain in this study. The framework integrates existing medical datasets and introduces tasks targeting three cognitive levels: preliminary knowledge grasp, comprehensive knowledge application, and scenario-based problem solving. Using this framework, we systematically evaluate state-of-the-art general and medical LLMs from five prominent families: Llama, Qwen, Gemma, Phi, and GPT. Our findings reveal a significant performance decline as cognitive complexity increases across evaluated models, with model size playing a more critical role in performance at higher cognitive levels. Our study highlights the need to enhance LLMs' medical capabilities at higher cognitive levels and provides insights for developing LLMs suited to real-world medical applications.
Paperid:478
Authors:Mingjia Li · Hong Qian · Tian-Zuo Wang · ShujunLi · Min Zhang · Aimin Zhou
Abstract:
Causal discovery aims to identify causal relationships from observational data. It is crucial in various scientific fields as it helps to understand underlying mechanisms. Recently, optimizationbased causal discovery methods have attracted extensive attention in the literature due to their efficiency in handling high-dimensional problems. However, we observe that optimization-based methods often perform well on certain problems but struggle with others. This paper identifies a specific characteristic of causal structural equations that determines the difficulty of identification in causal discovery and, in turn, the performance of optimization-based methods. We conduct an in-depth study of the additive noise model (ANM) and propose to further divide identifiable problems into strongly and weakly identifiable types based on the difficulty of identification. We also provide a sufficient condition to distinguish between these two categories. Inspired by these findings, this paper further proposes GENE, a generic method to address both strongly and weakly identifiable problems in a unified way under the ANM assumption. GENE adopts an order-based search framework that incorporates conditional independence tests into the order fitness evaluation, ensuring effectiveness on weakly identifiable problems. Besides, the proposed order-based optimization method restricts the dimensionality of effect variables to ensure scale invariance, a property crucial for practical applications. Experiments on both synthetic and real-world datasets demonstrate that GENE is uniquely effective in addressing weakly identifiable problems. For strongly identifiable problems, GENE remains competitive with state-of-the-art causal discovery algorithms.
Paperid:479
Authors:Zhenglin Wan · Xingrui Yu · David Bossens · Yueming LYU · Qing Guo · Flint Xiaofeng Fan · Yew Soon ONG · Ivor Tsang
Abstract:
Imitation learning (IL) has shown promise in robot locomotion but is often limited to learning a single expert policy, constraining behavior diversity and robustness in unpredictable realworld scenarios. To address this, we introduce Quality Diversity Inverse Reinforcement Learning (QD-IRL), a novel framework that integrates quality-diversity optimization with IRL methods, enabling agents to learn diverse behaviors from limited demonstrations. This work introduces Extrinsic Behavioral Curiosity (EBC), which allows agents to receive additional curiosity rewards from an external critic based on how novel the behaviors are with respect to a large behavioral archive. To validate the effectiveness of EBC in exploring diverse locomotion behaviors, we evaluate our method on multiple robot locomotion tasks. EBC improves the performance of QD-IRL instances with GAIL, VAIL, and DiffAIL by up to 185\%, 42\%, and 150\% in Humanoid environments, even surpassing expert performance by 20\%. Furthermore, we demonstrate that EBC is applicable to Quality Diversity Reinforcement Learning (QD-RL), where it substantially improves performance and provides a generic technique for diverse robot locomotion.
Paperid:480
Authors:Iulia Duta · Pietro Lió
Abstract:
The importance of higherorder relations is widely recognized in numerous real-world systems. However, annotating them is a tedious and sometimes even impossible task. Consequently, current approaches for data modelling either ignore the higher-order interactions altogether or simplify them into pairwise connections. To facilitate higher-order processing, even when a hypergraph structure is not available, we introduce SPHINX, a model that learns to infer a latent hypergraph structure in an unsupervised way, solely from the final task-dependent signal. To ensure broad applicability, we design the model to be end-to-end differentiable, capable of generating a discrete hypergraph structure compatible with any modern hypergraph networks, and easily optimizable without requiring additional regularization losses.Through extensive ablation studies and experiments conducted on four challenging datasets, we demonstrate that our model is capable of inferring suitable latent hypergraphs in both transductive and inductive tasks. Moreover, the inferred latent hypergraphs are interpretable and contribute to enhancing the final performance, outperforming existing methods for hypergraph prediction.
Paperid:481
Authors:Shi Yin · Xinyang Pan · fengyan wang · Lixin He
Abstract:
We propose a framework to combine strong nonlinear expressiveness with strict SO(3)-equivariance in prediction of the electronic-structure Hamiltonian, by exploring the mathematical relationships between SO(3)-invariant and SO(3)-equivariant quantities and their representations. The proposed framework, calledTraceGrad, first constructs theoretical SO(3)-invarianttracequantities derived from the Hamiltonian targets, and use these invariant quantities as supervisory labels to guide the learning of high-quality SO(3)-invariant features. Given that SO(3)-invariance is preserved under non-linear operations, the learning of invariant features can extensively utilize non-linear mappings, thereby fully capturing the non-linear patterns inherent in physical systems. Building on this, we propose agradient-based mechanism to induce SO(3)-equivariant encodings of various degrees from the learned SO(3)-invariant features. This mechanism can incorporate powerful non-linear expressive capabilities into SO(3)-equivariant features with consistency of physical dimensions to the regression targets, while theoretically preserving equivariant properties, establishing a strong foundation for predicting Hamiltonian. Our method achieves state-of-the-art performance in prediction accuracy across eight challenging benchmark databases on Hamiltonian prediction. Experimental results demonstrate that this approach not only improves the accuracy of Hamiltonian prediction but also significantly enhances the prediction for downstream physical quantities, and also markedly improves the acceleration performance for the traditional Density Functional Theory algorithms.
Paperid:482
Authors:Zhining Liu · Ze Yang · Xiao Lin · Ruizhong Qiu · Tianxin Wei · Yada Zhu · Hendrik Hamann · Jingrui He · Hanghang Tong
Abstract:
Timeseries forecasting plays a critical role in many real-world applications. Although increasingly powerful models have been developed and achieved superior results on benchmark datasets, through a fine-grainedsample-level inspection, we find that (i) no single model consistently outperforms others across different test samples, but instead (ii) each model excels in specific cases. These findings prompt us to explore how to adaptively leverage the distinct strengths of various forecasting models for different samples. We introduce TimeFuse, a framework for collective time-series forecasting withsample-level adaptive fusionof heterogeneous models. TimeFuse utilizes meta-features to characterize input time series and trains a learnable fusor to predict optimal model fusion weights for any given input. The fusor can leverage samples from diverse datasets for joint training, allowing it to adapt to a wide variety of temporal patterns and thus generalize to new inputs, even from unseen datasets. Extensive experiments demonstrate the effectiveness of TimeFuse in various long-/short-term forecasting tasks, achieving near-universal improvement over the state-of-the-art individual models.
Paperid:483
Authors:Junyan Li · Yang Zhang · Muhammad Yusuf Hassan · Talha Chafekar · Tianle Cai · Zhile Ren · Pengsheng Guo · Foroozan Karimzadeh · Colorado Reed · Chong Wang · Chuang Gan
Abstract:
Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the keyvalue (KV) cache often becomes a memory bottleneck on GPUs as context lengths grow. To address this, we proposeCommutativeVectorQuantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. First, we leverage additive quantization by introducing a lightweight encoder and codebook to compress the KV cache, which can then be decoded with a simple matrix multiplication. Second, to tackle the high computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE), and utilize an Expectation-Maximization (EM) algorithm to learn the codebook. This enables efficient integration of decoding into the self-attention mechanism, significantly reducing computational overhead. Our approach achieves superior accuracy through additive quantization while lowering computational costs with our RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K demonstrate that our method reduces FP16 KV cache size by 87.5% while maintaining higher accuracy than state-of-the-art KV cache quantization methods. Remarkably, it enables 1-bit quantization of the KV cache with minimal accuracy degradation, making it possible to run LLaMA-3.1 8B model with a 500K context length on a single 4090 GPU.
Paperid:484
Authors:Qian Peng · Yajie Bao · Haojie Ren · Zhaojun Wang · Changliang Zou
Abstract:
Abstract:Conformal prediction is a powerful tool for constructing prediction intervals for blackbox models, providing a finite sample coverage guarantee for exchangeable data. However, this exchangeability is compromised when some entries of the test feature are contaminated, such as in the case of cellwise outliers. To address this issue, this paper introduces a novel framework called *detect-then-impute conformal prediction*. This framework first employs an outlier detection procedure on the test feature and then utilizes an imputation method to fill in those cells identified as outliers. To quantify the uncertainty in the processed test feature, we adaptively apply the detection and imputation procedures to the calibration set, thereby constructing exchangeable features for the conformal prediction interval of the test label. We develop two practical algorithms, $\texttt{PDI-CP}$ and $\texttt{JDI-CP}$, and provide a distribution-free coverage analysis under some commonly used detection and imputation procedures. Notably, $\texttt{JDI-CP}$ achieves a finite sample $1-2\alpha$ coverage guarantee. Numerical experiments on both synthetic and real datasets demonstrate that our proposed algorithms exhibit robust coverage properties and comparable efficiency to the oracle baseline.
Paperid:485
Authors:Firas Laakom · Haobo Chen · Yuheng Bu · Jürgen Schmidhuber
Abstract:
Despite substantial progress in promoting fairness in highstake applications using machine learning models, existing methods often modify the training process—such as through regularizers or other interventions—but lack formal guarantees that fairness achieved during training will generalize to unseen data. Although overfitting with respect to prediction performance has been extensively studied, overfitting in terms of fairness loss has received far less attention. In this paper, we propose a theoretical framework for analyzing fairness generalization error through an information-theoretic lens. Our novel bounding technique is based on Efron–Stein inequality, which allows us to derive tight information-theoretic fairness generalization bounds with both Mutual Information (MI) and Conditional Mutual Information (CMI). Our empirical results validate the tightness and practical relevance of these bounds across diverse fairness-aware learning algorithms. Our framework offers valuable insights to guide the design of algorithms improving fairness generalization.
Paperid:486
Authors:Hongcheng Guo · Yue Wang · Shaosheng Cao · Fei zhao · Boyang Wang · Lei Li · Liang Chen · Xinze Lyu · Zhe Xu · Yao Hu · Zhoujun Li
Abstract:
With the rapid advancement of Social Networking Services (SNS), the need for intelligent and efficient interaction within diverse platforms has become more crucial. Large Language Models (LLMs) play an importent role in SNS as they possess the potential to revolutionize user experience, content generation, and communication dynamics. However, recent studies focus on isolated SNS tasks rather than a comprehensive evaluation.In this paper, we introduce SNSBench, specially constructed for assessing the abilities of large language models from different \textbf{S}ocial \textbf{N}etworking \textbf{S}ervices, with a wide range of SNS-related information. SNS-Bench encompasses 8 different tasks such as note classification, query content relevance, and highlight words generation in comments. Finally, 6,658 questions of social media text, including subjective questions, single-choice, and multiple-choice questions, are concluded in SNS-Bench. Further, we evaluate the performance of over 25+ current diverse LLMs on our SNS-Bench. Models with different sizes exhibit performance variations, yet adhere to the scaling law.Moreover, we hope provide more insights to revolutionize the techniques of social network services with LLMs
Paperid:487
Authors:Dooho Lee · Myeong Kong · Sagad Hamid · Cheonwoo Lee · Jaemin Yoo
Abstract:
We revisit DropEdge, a data augmentation technique for GNNs which randomly removes edges to expose diverse graph structures during training. While being a promising approach to effectively reduce overfitting on specific connections in the graph, we observe that its potential performance gain in supervised learning tasks is significantly limited. To understand why, we provide a theoretical analysis showing that the limited performance of DropEdge comes from the fundamental limitation that exists in many GNN architectures.Based on this analysis, we proposeAggregation Buffer, a parameter block specifically designed to improve the robustness of GNNs by addressing the limitation of DropEdge. Our method is compatible with any GNN model, and shows consistent performance improvements on multiple datasets. Moreover, our method effectively addresses wellknown problems such as degree bias or structural disparity as a unifying solution. Code and datasets are available at https://anonymous.4open.science/r/AggregationBuffer-46B4.
Paperid:488
Authors:Guy Hacohen · Tinne Tuytelaars
Abstract:
Catastrophic forgetting the tendency of neural networks to forget previously learned data when learning new information - remains a major challenge in continual learning. In this work, we take a behavioral perspective, observing a connection between learning speed and forgetting: examples learned more quickly tend to be more resistant to forgetting. We then study replay-based continual learning and show that the choice of which examples populate the replay buffer - specifically, whether they were learned quickly or slowly - significantly impacts forgetting. Building on these insights, we introduce Speed-Based Sampling (SBS), a simple yet general strategy that selects replay examples based on their learning speed. SBS integrates seamlessly into existing buffer-based methods and consistently improves performance across a wide range of competitive continual learning benchmarks, advancing state-of-the-art results. Our findings highlight the importance of considering example susceptibility to forgetting when designing continual learning algorithms.
Paperid:489
Authors:Yi Xie · Jaedong Hwang · Carlos Brody · David Tank · Ila R. Fiete
Abstract:
Brains excel at robust decisionmaking and data-efficient learning. Understanding the architectures and dynamics underlying these capabilities can inform inductive biases for deep learning. We present a multi-region brain model that explores the normative role of structured memory circuits in a spatially embedded binary decision-making task.We counterfactually compare the learning performance and neural representations of reinforcement learning (RL) agents with brain models of different interaction architectures between grid and place cells in the entorhinal cortex and hippocampus, coupled with an action-selection cortical recurrent neural network. We demonstrate that a specific architecture--where grid cells receive and jointly encode abstract velocity signals for self-movement and decision evidence increments--optimizes learning efficiency while best reproducing experimental observations relative to alternative architectures.Our findings therefore suggest brain-inspired structured architectures for efficient RL. Importantly, the models make novel, testable predictions about the organization and information flow within the entorhinal-hippocampal-neocortical circuitry: we predict that grid cells must conjunctively encode position and evidence for effective spatial decision-making, directly motivating new neurophysiological experiments.
Paperid:490
Authors:Yuchen Lin · Ronan Le Bras · Kyle Richardson · Ashish Sabharwal · Radha Poovendran · Peter Clark · Yejin Choi
Abstract:
We investigate the logical reasoning capabilities of Large Language Models (LLMs) and their scalability across complex deductive tasks. Using ZebraLogic, a newly developed benchmark dataset of logic grid puzzles derived from constraint satisfaction problems (CSPs), we systematically evaluate LLM performance. ZebraLogic spans a broad range of search space complexities and incorporates diverse logical constraints, providing a controlled environment to assess reasoning abilities. Our results reveal a significant decline in accuracy as problem complexity increases—a phenomenon we term the “curse of complexity.” Notably, this limitation persists even with scaling model size and inferencetime computation, suggesting fundamental constraints in current LLM reasoning capabilities. Additionally, we explore strategies such as Best-of-N sampling, backtracking mechanisms, and self-verification prompts to enhance logical reasoning performance. Our findings provide critical insights into the scaling behavior of LLMs, highlight their limitations, and outline potential directions for advancing their reasoning capabilities.
Paperid:491
Authors:Aaron Mueller · Atticus Geiger · Sarah Wiegreffe · Dana Arad · Iván Arcuschin · Adam Belfki · Yik Siu Chan · Jaden Fiotto-Kaufman · Tal Haklay · Michael Hanna · Jing Huang · Rohan Gupta · Yaniv Nikankin · Hadas Orgad · Nikhil Prakash · Anja Reusch · Aruna Sankaranarayanan · Shun Shao · Alessandro Stolfo · Martin Tutek · Amir Zur · David Bau · Yonatan Belinkov
Abstract:
Progress in mechanistic interpretability has been rapid, but there exist few standards for comparing methods. How can we know whether new methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. The circuit localization track compares methods that locate the model components—and connections between them—most important for performing a task (e.g., attribution patching or information flow routes).The causal variable localization track compares methods that featurize a hidden vector (e.g., sparse autoencoders or distributed alignment search) and locate mediators of a causal variable relevant to the task. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. We also find that supervised methods for causal variable localization outperform unsupervised methods, and both outperform the simplest baseline. These findings illustrate that MIB enables meaningful comparisons of MI methods, and increases our confidence that there has been real progress in the field.
Paperid:492
Authors:Alireza Naderi · Thiziri Nait Saada · Jared Tanner
Abstract:
Attention layers are the core component of transformers, the current stateof-the-art neural network architecture. Alternatives to softmax-based attention are being explored due to its tendency to hinder effective information flow. Evenat initialisation, it remains poorly understood why the propagation of signals and gradients through these random networks can be pathological, resulting in issues known as (i) vanishing/exploding gradients and (ii) rank collapsein depth, i.e. when all tokens converge to a single representation along layers. While rank collapse in depth naturally arises from repeated matrix multiplications---a common pattern across various architectures---we identify an additional and previously unknown challenge unique to softmax attention layers: (iii) rank collapsein width, which occurs as the context length increases. Using Random Matrix Theory, we conduct a rigorous analysis that uncovers a spectral gap between the two largest singular values of the attention matrix as the cause of (iii), which in turn exacerbates (i) and (ii).Building on this insight, we propose a novel yet simple practical solution to mitigate rank collapse in width by removing the outlier eigenvalue(s). Our theoretical framework offers a fresh perspective on recent practical studies, such as (Ye et al., 2024; Ali et al., 2023), whose ad hoc solutions can now be interpreted as implicit efforts to address the spectral gap issue. This work provides valuable theoretical support for ongoing large-scale empirical research, bringing theory and practice one step closer in the understanding of transformers.
Paperid:493
Authors:Gaurav Ghosal · Pratyush Maini · Aditi Raghunathan
Abstract:
Large language models are susceptible to memorizing repeated sequences, posing serious privacy concerns. Existing attempts to isolate memorized information to specific neurons have seen limited success. Through experiments in a controlled setting, we demonstrate that standard training simply does not lead to ``memorization’’ neurons that can be easily isolated, unless the sequence is extremely atypical. Instead, we argue for a new training paradigm designed to simultaneously achieve two seemingly conflicting goals: isolation—confine memorization to specific neurons, and generalization—preserve learning across the dataset. Generalization requires a subset of shared neurons to be active across all sequences. Inspired by the dynamics of memorization, we argue that isolation can be encouraged even in the presence of shared neurons by simply activating a fixed prespecified set of memorization neurons for each sequence. Put together, we propose a scalable and practical method called sequence-tied dropout that enables perfect post-hoc isolation while preserving general model capabilities on the TinyStories dataset. To the best of our knowledge, this is the first proof-of-concept on small-scale real data showing that simultaneous generalization and isolation is in fact feasible.
Paperid:494
Authors:Tim Vieira · Benjamin LeBrun · Mario Giulianelli · Juan Luis Gastaldi · Brian DuSell · John Terilla · Timothy O'Donnell · Ryan Cotterell
Abstract:
Modern language models are internally—and mathematically—distributions over token strings rather than character strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the tokenlevel language model. Thus, the tokenizer and consequent analyses are very sensitive to the specification of the prompt (e.g., if the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. We find that---even with a small computation budget—our method is able to accurately approximate the character-level distribution (less than 0.00021 excess bits / character) at reasonably fast speeds (46.3 characters / second) on the Llama 3.1 8B language model.
Paperid:495
Authors:Emiliano Penaloza · Tianyue Zhang · Laurent Charlin · Mateo Espinosa Zarlenga
Abstract:
Concept Bottleneck Models (CBMs) propose toenhance the trustworthiness of AI systems byconstraining their decisions on a set of humanunderstandable concepts. However, CBMs typ-ically rely on datasets with assumedly accurateconcept labels—an assumption often violated inpractice which we show can significantly degradeperformance. To address this, we introduce theConcept Preference Optimization (CPO) objec-tive, a new loss function based on Direct Pref-erence Optimization, which effectively mitigatesthe negative impact of concept mislabeling onCBM performance. We provide an analysis onsome key properties of the CPO objective show-ing it directly optimizes for the concept’s posteriordistribution, and contrast it against Binary CrossEntropy (BCE) where we show CPO is inherentlyless sensitive to concept noise. We empiricallyconfirm our analysis finding that CPO consistentlyoutperforms BCE in three real-world datasets withand without added label noise
Paperid:496
Authors:Nikolaos Chaidos · Angeliki Dimitriou · Maria Lymperaiou · Giorgos Stamou
Abstract:
Despite the dominance of convolutional and transformerbased architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we presentSCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate forGraph Edit Distance(GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.
Paperid:497
Authors:Yao Xiao · Binbin Yang · Weiyan Chen · Jiahao Chen · Zijie Cao · ZiYi Dong · Xiangyang Ji · Liang Lin · Wei Ke · Pengxu Wei
Abstract:
The remarkable evolution of generative models enables the generation of highquality and visually attractive images. Notably, these images can be perceptually indistinguishable from real photographs for human eyes, therefore attracting increasing attention to developing models for AI-generated image (AIGI) detection. Intuitively, as the visual quality of generated images improves, the difficulty of forgery detection increases accordingly. However, with a systematic study on four cutting-edge text-to-image generators, we intriguingly find that generated images with higher quality scores, as assessed by human preference models, tend to be more easily detected by existing AIGI detectors. To investigate this counterintuitive phenomenon, we study how the text condition provided to the generator and the characteristics of the generated images affect the quality scores and the accuracy of detectors. We observe that images generated from short prompts tend to achieve higher preference scores while being easier to detect. Moreover, we cluster the images based on their embeddings from preference models, and conduct regression analyses to verify that image characteristics, such as saturation, contrast, and texture richness, collectively influence both quality scores and AIGI detector accuracy. Finally, we showcase that the performance of the off-the-shelf detectors can be enhanced across diverse generators and datasets by selecting input patches based on the predicted scores of our regression models, thereby substantiating the broader applicability of our findings.
Paperid:498
Authors:Filippo Rinaldi · Giacomo Capitani · Lorenzo Bonicelli · Angelo Porrello · Donato Crisostomi · Federico Bolelli · Emanuele Rodola · ELISA FICARRA · Simone Calderara
Abstract:
Foundation models serve as the backbone for numerous specialized models developed through finetuning. However, when the underlying pretrained model is updated or retrained (e.g., on larger and more curated datasets), the fine-tuned model becomes obsolete, losing its utility and requiring retraining. This raises the question: is it possible to transfer fine-tuning to a new release of the model? In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner. To do so, we draw principles from model re-basin and provide a recipe based on weight permutations to re-base the modifications made to the original base model, often called task vector. In particular, our approach tailors model re-basin for Transformer models, taking into account the challenges of residual connections and multi-head attention layers. Specifically, we propose a two-level method rooted in spectral theory, initially permuting the attention heads and subsequently adjusting parameters within select pairs of heads. Through extensive experiments on visual and textual tasks, we achieve the seamless transfer of fine-tuned knowledge to new pre-trained backbones without relying on a single training step or datapoint.
Paperid:499
Authors:Beepul Bharti · Mary Clemens-Sewall · Paul H. Yi · Jeremias Sulam
Abstract:
As the use of predictive machine learning algorithms increases in highstakes decision-making, it is imperative that these algorithms are fair across sensitive groups. However, measuring and enforcing fairness in real-world applications can be challenging due to missing or incomplete sensitive group data. Proxy-sensitive attributes have been proposed as a practical and effective solution in these settings, but only for parity-based fairness notions. Knowing how to evaluate and control for fairness with missing sensitive group data for newer and more flexible frameworks, such as multiaccuracy and multicalibration, remains unexplored. In this work, we address this gap by demonstrating that in the absence of sensitive group data, proxy-sensitive attributes can provably be used to derive actionable upper bounds on the true multiaccuracy and multicalibration, providing insights into a model’s potential worst-case fairness violations. Additionally, we show that adjusting models to satisfy multiaccuracy and multicalibration across proxy-sensitive attributes can significantly mitigate these violations for the true, but unknown, demographic groups. Through several experiments on real-world datasets, we illustrate that approximate multiaccuracy and multicalibration can be achieved even when sensitive group data is incomplete or unavailable.
Paperid:500
Authors:Hong-You Chen · Zhengfeng Lai · Haotian Zhang · Xinze Wang · Marcin Eichner · Keen You · Meng Cao · Bowen Zhang · Yinfei Yang · Zhe Gan
Abstract:
CLIP has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, it has been widely adopted as the vision backbone of multimodal large language models (MLLMs). The success of CLIP relies on aligning webcrawled noisy text annotations at image levels. However, such criteria may be insufficient for downstream tasks in need of fine-grained vision representations, especially when understanding region-level is demanding for MLLMs. We improve the localization capability of CLIP with several advances. Our proposed pre-training method, Contrastive Localized Language-Image Pre-training (CLOC), complements CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text labels. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
Paperid:501
Authors:Amirhossein Roknilamouki · Arnob Ghosh · Ming Shi · Fatemeh Nourzad · Eylem Ekici · Ness Shroff
Abstract:
Abstract:In Reinforcement Learning (RL), tasks with instantaneous hard constraints present significant challenges, particularly when the decision space is nonconvex or non-star-convex. This issue is especially relevant in domains like autonomous vehicles and robotics, where constraints such as collision avoidance often take a non-convex form. In this paper, we establish a regret bound of $\tilde{\mathcal{O}}\bigl(\bigl(1 + \tfrac{1}{\tau}\bigr) \sqrt{\log\bigl(\tfrac{1}{\tau}\bigr) d^3 H^4 K} \bigr)$, applicable to both star-convex and non-star-convex cases, where $d$ is the feature dimension, $H$ the episode length, $K$ the number of episodes, and $\tau$ the safety threshold. Moreover, the violation of safety constraints is zero with high probability throughout the learning process. A key technical challenge in these settings is bounding the covering number of the value-function class, which is essential for achieving value-aware uniform concentration in model-free function approximation. For the star-convex setting, we develop a novel technique called $\textit{Objective–Constraint Decomposition}$ (OCD) to properly bound the covering number. This result also resolves an error in a previous work on constrained RL. In non-star-convex scenarios, where the covering number can become infinitely large, we propose a two-phase algorithm, Non-Convex Safe Least Squares Value Iteration (NCS-LSVI), which first reduces uncertainty about the safe set by playing a known safe policy. After that, it carefully balances exploration and exploitation to achieve the regret bound. Finally, numerical simulations on an autonomous driving scenario demonstrate the effectiveness of NCS-LSVI.
Paperid:502
Authors:Avery Ma · Yangchen Pan · Amir-massoud Farahmand
Abstract:
Manyshot jailbreaking circumvents the safety alignment of large language models by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational turns between the user and the model. These fabricated exchanges are randomly sampled from a pool of malicious questions and responses, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. Extensive experiments on AdvBench and HarmBench, using state-of-the-art LLMs, demonstrate that PANDAS significantly outperforms baseline methods in long-context scenarios. Through an attention analysis, we provide insights on how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
Paperid:503
Authors:Gwanhyeong Koo · Sunjae Yoon · Younghwan Lee · Ji Woo Hong · Chang Yoo
Abstract:
Dragbased editing allows precise object manipulation through point-based control, offering user convenience. However, current methods often suffer from a geometric inconsistency problem by focusing exclusively on matching user-defined points, neglecting the broader geometry and leading to artifacts or unstable edits. We propose FlowDrag, which leverages geometric information for more accurate and coherent transformations. Our approach constructs a 3D mesh from the image, using an energy function to guide mesh deformation based on user-defined drag points. The resulting mesh displacements are projected into 2D and incorporated into a UNet denoising process, enabling precise handle-to-target point alignment while preserving structural integrity. Additionally, existing drag-editing benchmarks provide no ground truth, making it difficult to assess how accurately the edits match the intended transformations. To address this, we present VFD (VidFrameDrag) benchmark dataset, which provides ground-truth frames using consecutive shots in a video dataset. FlowDrag outperforms existing drag-based editing methods on both VFD Bench and DragBench.
Paperid:504
Authors:Shriya Bhatija · Paul-David Zuercher · Jakob Thumm · Thomas Bohné
Abstract:
In decisionmaking problems, the outcome of an intervention often depends on the causal relationships between system components and is highly costly to evaluate. In such settings, causal Bayesian optimization (CBO) can exploit the causal relationships between the system variables and sequentially perform interventions to approach the optimum with minimal data. Extending CBO to handle multiple objectives, we proposemulti-objective causal Bayesian optimization(MO-CBO), a paradigm for identifying Pareto-optimal interventions within a known causal graph. We first derive a graphical characterization for potentially optimal sets of variables to intervene upon. Showing that any MO-CBO problem can be decomposed into several traditional multi-objective optimization tasks, we then introduce an algorithm that sequentially balances exploration across these tasks. The proposed method will be validated on both synthetic and real-world causal graphs, demonstrating its superiority over traditional (non-causal) multi-objective Bayesian optimization in settings where causal information is available.
Paperid:505
Authors:Kai Lu · Shixiong Xu · Jinqiu Li · Kun Ding · Gaofeng Meng
Abstract:
Feedback from peer review is essential to improve the quality of scientific articles. However, at present, many manuscripts do not receive sufficient external feedback for refinement before or during submission. Therefore, a system capable of providing detailed and professional feedback is crucial for enhancing research efficiency. In this paper, we have compiled the largest dataset of paper reviews to date by collecting historical openaccess papers and their corresponding review comments and standardizing them using LLM. We then developed a multi-agent system that mimics real human review processes, based on LLMs. This system, named Agent Reviewers, includes the innovative introduction of multimodal reviewers to provide feedback on the visual elements of papers. Additionally, a shared memory pool that stores historical papers' metadata is preserved, which supplies reviewer agents with background knowledge from different fields. Our system is evaluated using ICLR 2024 papers and achieves superior performance compared to existing AI-based review systems. Comprehensive ablation studies further demonstrate the effectiveness of each module and agent in this system.
Paperid:506
Authors:Xiaowen Ma · Zhen-Liang Ni · Shuai Xiao · Xinghao Chen
Abstract:
In longterm time series forecasting, different variables often influence the target variable over distinct time intervals, a challenge known as the multi-delay issue. Traditional models typically process all variables or time points uniformly, which limits their ability to capture complex variable relationships and obtain non-trivial time representations. To address this issue, we propose TimePro, an innovative Mamba-based model that constructs variate- and time-aware hyper-states. Unlike conventional approaches that merely transfer plain states across variables or time dimension, TimePro preserves the fine-grained temporal features of each variate token and adaptively selects the focused time points to tune the plain state. The reconstructed hyper-state is able to perceive both variable relationship and salient temporal information, which helps the model make accurate forecasting. In experiments, TimePro achieves competitive performance on eight real-world long-term forecasting benchmarks with satisfactory linear complexity.
Paperid:507
Authors:Meshi Bashari · Matteo Sesia · Yaniv Romano
Abstract:
Conformal prediction is a flexible framework for calibrating machine learning predictions, providing distributionfree statistical guarantees. In outlier detection, this calibration relies on a reference set of labeled inlier data to control the type-I error rate. However, obtaining a perfectly labeled inlier reference set is often unrealistic, and a more practical scenario involves access to a contaminated reference set containing a small fraction of outliers. This paper analyzes the impact of such contamination on the validity of conformal methods. We prove that under realistic, non-adversarial settings, calibration on contaminated data yields conservative type-I error control, shedding light on the inherent robustness of conformal methods. This conservativeness, however, typically results in a loss of power. To alleviate this limitation, we propose a novel, active data-cleaning framework that leverages a limited labeling budget and an outlier detection model to selectively annotate data points in the contaminated reference set that are suspected as outliers. By removing only the annotated outliers in this ``suspicious'' subset, we can effectively enhance power while mitigating the risk of inflating the type-I error rate, as supported by our theoretical analysis. Experiments on real datasets validate the conservative behavior of conformal methods under contamination and show that the proposed data-cleaning strategy improves power without sacrificing validity.
Paperid:508
Authors:Evan Sidrow · Alexandre Bouchard-Côté · Lloyd Elliott
Abstract:
Bayesian phylogenetics requires accurate and efficient approximation of posterior distributions over trees. In this work, we develop a variational Bayesian approach for ultrametric phylogenetic trees. We present a novel variational family based on coalescent times of a singlelinkage clustering and derive a closed-form density of the resulting distribution over trees. Unlike existing methods for ultrametric trees, our method performs inference over all of tree space, it does not require any Markov chain Monte Carlo subroutines, and our variational family is differentiable. Through experiments on benchmark genomic datasets and an application to SARS-CoV-2, we demonstrate that our method achieves competitive accuracy while requiring significantly fewer gradient evaluations than existing state-of-the-art techniques.
Paperid:509
Authors:Xiong Wang · Yangze Li · Chaoyou Fu · Yike Zhang · Yunhang Shen · Lei Xie · Ke Li · Xing Sun · Long Ma
Abstract:
The GPT4o's excellent duplex speech interaction ability has given users an impressive experience. Researchers have recently proposed several multi-modal LLMs to achieve user-agent speech-to-speech conversations. In this paper, we propose a novel speech-text multimodal LLM architecture called Freeze-Omni, and our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while achieving low latency end-to-end spoken response. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, giving Freeze-Omni a more natural style of dialogue ability between users and agents. In summary, Freeze-Omni holds great potential to conduct speech-to-speech dialogue based on a multimodal LLM under the condition of a frozen LLM, avoiding the catastrophic forgetting problem caused by limited data and training resources.
Paperid:510
Authors:Francesco Emanuele Stradi · Matteo Castiglioni · Alberto Marchesi · Nicola Gatti
Abstract:
We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we design an algorithm attaining sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that constraints are satisfied at every episode with high probability. In the last scenario, we only assume the existence of a strictly feasible policy, which is not known to the learner, and we design an algorithm attaining sublinear regret and constant cumulative positive constraints violation. Finally, we show that in the last two scenarios, a dependence on the Slater's parameter is unavoidable. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Thus, our algorithms can deal with general nonstationary environments subject to requirements much stricter than those manageable with existing ones, enabling their adoption in a much wider range of applications.
Paperid:511
Authors:Oren Mangoubi · Neil He · Nisheeth K. Vishnoi
Abstract:
Abstract:We introduce a framework for designing efficient diffusion models for $d$dimensional symmetric-space Riemannian manifolds, including the torus, sphere, special orthogonal group and unitary group. Existing manifold diffusion models often depend on heat kernels, which lack closed-form expressions and require either $d$ gradient evaluations or exponential-in-$d$ arithmetic operations per training step. We introduce a new diffusion model for symmetric manifolds with a spatially-varying covariance, allowing us to leverage a projection of Euclidean Brownian motion to bypass heat kernel computations. Our training algorithm minimizes a novel efficient objective derived via Ito's Lemma, allowing each step to run in $O(1)$ gradient evaluations and nearly-linear-in-$d$ ($O(d^{1.19})$) *arithmetic* operations, reducing the gap between diffusions on symmetric manifolds and Euclidean space. Manifold symmetries ensure the diffusion satisfies an "average-case" Lipschitz condition, enabling accurate and efficient sample generation. Empirically, our model outperforms prior methods in training speed and improves sample quality on synthetic datasets on the torus, special orthogonal group, and unitary group.
Paperid:512
Authors:Xiulong Yuan · Hongtao Xu · Wenting Shen · Ang Wang · Xiafei Qiu · Jie Zhang · Yuqiong Liu · Bowen Yu · Junyang Lin · Mingzhen Li · Weile Jia · Yong Li · Wei Lin
Abstract:
Long context finetuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail distribution and employ training strategies designed specifically for long sequences. Moreover, these approaches also fail to address the challenges posed by variable sequence lengths during distributed training, such as load imbalance in data parallelism and severe pipeline bubbles in pipeline parallelism. These issues lead to suboptimal training performance and poor GPU resource utilization.To tackle these problems, we propose a chunk-centric training method named ChunkFlow. ChunkFlow reorganizes input sequences into uniformly sized chunks by consolidating short sequences and splitting longer ones. This approach achieves optimal computational efficiency and balance among training inputs. Additionally, ChunkFlow incorporates a state-aware chunk scheduling mechanism to ensure that the peak memory usage during training is primarily determined by the chunk size rather than the maximum sequence length in the dataset. Integrating this scheduling mechanism with existing pipeline scheduling algorithms further enhances the performance of distributed training.Experimental results demonstrate that, compared with Megatron-LM, ChunkFlow can be up to 4.53x faster in the long context fine-tuning of LLMs. Furthermore, we believe that ChunkFlow serves as an effective solution for a broader range of scenarios, such as long context continual pre-training, where datasets contain variable-length sequences.
Paperid:513
Authors:Chenglong Wang · Yang Gan · Yifu Huo · Yongyu Mu · Qiaozhi He · MuRun Yang · Bei Li · Tong Xiao · Chunliang Zhang · Tongran Liu · Jingbo Zhu
Abstract:
In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward models using both unlabeled and labeled data. Building on the generative models in LLMs, we develop a generative reward model that is first trained via largescale unsupervised learning and then fine-tuned via supervised learning. We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss. This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives. The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort. Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models.
Paperid:514
Authors:Dimitri von Rütte · Janis Fluri · Yuhui Ding · Antonio Orvieto · Bernhard Schölkopf · Thomas Hofmann
Abstract:
While stateof-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion and derive the theoretical backbone of a family of general interpolating discrete diffusion (GIDD) processes offering greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled. Our code and models will be released upon acceptance.
Paperid:515
Authors:Shenao Zhang · Zhihan Liu · Boyi Liu · Yufeng Zhang · Yingxiang Yang · Yongfei Liu · Liyu Chen · Tao Sun · Zhaoran Wang
Abstract:
Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often overlook the qualitative aspects of responses, despite having access to preference data that includes reward scores from judge models during AI feedback. Striving to maximize the implicit reward gap between the chosen and the slightly inferior rejected responses can cause overfitting and unnecessary unlearning of the highquality rejected responses. The unawareness of the reward scores also drives the LLM to indiscriminately favor the low-quality chosen responses and fail to generalize to optimal responses that are sparse in data. To overcome these shortcomings, our study introduces reward-conditioned LLM policies that discern and learn from the entire spectrum of response quality within the dataset, helping extrapolate to more optimal regions. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset. The experimental results across various instruction-following and academic benchmarks demonstrate that our approach consistently boosts the performance of DPO by a considerable margin across diverse models. When applying our method to on-policy data, the resulting DPO model outperforms various baselines and achieves state-of-the-art results. Through comprehensive ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
Paperid:516
Authors:Yuanze Wang · Yichao Yan · Jin · Shiming Song · Yilan Huang · Xingdong Sheng · Dianxi Shi
Abstract:
Visual localization aims to predict the absolute camera pose for a single query image. However, predominant methods focus on singlecamera images and scenes with limited appearance variations, limiting their applicability to cross-domain scenes commonly encountered in real-world applications. Furthermore, the long-tail distribution of cross-domain datasets poses additional challenges for visual localization. In this work, we propose a novel cross-domain data generation framework to enhance visual localization methods. To achieve this, we first construct a cross-domain 3DGS to accurately model photometric variations and mitigate the interference of dynamic objects in large-scale scenes. We introduce a text-guided image editing model to enhance data diversity for addressing the long-tail distribution problem and design an effective fine-tuning strategy for it. Then, we develop an anchor-based method to generate high-quality datasets for visual localization. Finally, we introduce positional attention to address data ambiguities in cross-camera images. Extensive experiments show that our method achieves state-of-the-art accuracy, outperforming existing cross-domain visual localization methods by an average of 59% across all domains.
Paperid:517
Authors:Yuhang Cai · Kangjie Zhou · Jingfeng Wu · Song Mei · Michael Lindsey · Peter Bartlett
Abstract:
We establish the asymptotic implicit bias of gradient descent (GD) for generic nonhomogeneous deep networks under exponential loss. Specifically, we characterize three key properties of GD iterates starting from a sufficiently small empirical risk, where the threshold is determined by a measure of the network's non-homogeneity. First, we show that a normalized margin induced by the GD iterates increases nearly monotonically. Second, we prove that while the norm of the GD iterates diverges to infinity, the iterates themselves converge in direction. Finally, we establish that this directional limit satisfies the Karush–Kuhn–Tucker (KKT) conditions of a margin maximization problem. Prior works on implicit bias have focused exclusively on homogeneous networks; in contrast, our results apply to a broad class of non-homogeneous networks satisfying a mild near-homogeneity condition. In particular, our results apply to networks with residual connections and non-homogeneous activation functions, thereby resolving an open problem posed by \citet{ji2020directional}.
Paperid:518
Authors:Zixiang Ai · Zichen Liu · Jiahuan Zhou
Abstract:
Vision GNN (ViG) demonstrates superior performance by representing images as graph structures, providing a more natural way to capture irregular semantic patterns beyond traditional grid or sequencebased representations. To efficiently adapt ViG to downstream tasks, parameter-efficient fine-tuning techniques like visual prompting become increasingly essential. However, existing prompting methods are primarily designed for Transformer-based models, neglecting the rich topological relationships among nodes and edges in graph-based representations, limiting their capacity to model complex semantics. In this paper, we propose Vision Graph Prompting (VGP), a novel framework tailored for vision graph structures. Our core insight reveals that semantically connected components in the graph exhibit low-rank properties. Building on this observation, we introduce a semantic low-rank prompting method that decomposes low-rank semantic features and integrates them with prompts on vision graph topologies, capturing both global structural patterns and fine-grained semantic dependencies. Extensive experiments demonstrate our method significantly improves ViG’s transfer performance on diverse downstream tasks, achieving results comparable to full fine-tuning while maintaining parameter efficiency.
Paperid:519
Authors:Michael Crawshaw · Blake Woodworth · Mingrui Liu
Abstract:
Abstract:Existing analysis of Local (Stochastic) Gradient Descent for heterogeneous objectives requires stepsizes $\eta \leq 1/K$ where $K$ is the communication interval, which ensures monotonic decrease of the objective. In contrast, we analyze Local Gradient Descent for logistic regression with separable, heterogeneous data using any stepsize $\eta > 0$. With $R$ communication rounds and $M$ clients, we show convergence at a rate $\mathcal{O}(1/\eta K R)$ after an initial unstable phase lasting for $\widetilde{\mathcal{O}}(\eta K M)$ rounds. This improves upon the existing $\mathcal{O}(1/R)$ rate for general smooth, convex objectives. Our analysis parallels the single machine analysis of Wu et al. (2024) in which instability is caused by extremely large stepsizes, but in our setting another source of instability is large local updates with heterogeneous objectives.
Paperid:520
Authors:Haohan Zou · Jie Feng · Hao Zhao · Yuanyuan Shi
Abstract:
Despite advances in learningbased methods, finding valid Lyapunov functions for nonlinear dynamical systems remains challenging. Current neural network approaches face two main issues: challenges in scalable verification and limited interpretability. To address these, we propose an end-to-end framework using transformers to construct analytical Lyapunov functions (local), which simplifies formal verification, enhances interpretability, and provides valuable insights for control engineers. Our framework consists of a transformer-based trainer that generates candidate Lyapunov functions and a falsifier that verifies candidate expressions and refines the model via risk-seeking policy gradient. Unlike Alfarano et al. (2024), which utilizes pre-training and seeks global Lyapunov functions for low-dimensional systems, our model is trained from scratch via reinforcement learning and succeeds in finding local Lyapunov functions for high-dimensional and non-polynomial systems. Given the analytical nature of the candidates, we employ efficient optimization methods for falsification during training and formal verification tools for the final verification. We demonstrate the efficiency of our approach on a range of nonlinear dynamical systems with up to ten dimensions and show that it can discover Lyapunov functions not previously identified in the control literature.
Paperid:521
Authors:Julia Gusak · Xunyi Zhao · Théotime Le Hellard · Zhe LI · Lionel Eyraud-Dubois · Olivier Beaumont
Abstract:
Training modern neural networks on memorylimited GPUs is challenging, as storing intermediate activations often exceeds available memory. Re-materialization, a technique that preserves exact computations, addresses this by selectively recomputing activations instead of storing them. However, existing methods either fail to scale, lack generality, or introduce excessive execution overhead. We introduce HiRemate, a hierarchical re-materialization framework that recursively partitions large computation graphs, applies optimized solvers at multiple levels, and merges solutions into a global efficient training schedule. This enables scalability to significantly larger graphs than prior ILP-based methods while keeping runtime overhead low. Designed for single-GPU models and activation re-materialization, HiRemate extends the feasibility of training networks with thousands of graph nodes, surpassing prior methods in both efficiency and scalability. Experiments on various types of networks yield up to 50-70% memory reduction with only 10-15% overhead, closely matching optimal solutions while significantly reducing solver time. Seamlessly integrating with PyTorch Autograd, HiRemate requires almost no code change to use, and will be open-sourced, enabling broad adoption in memory-constrained deep learning.
Paperid:522
Authors:Ningyuan Huang · Miguel Sarabia · Abhinav Moudgil · Pau Rodriguez · Luca Zappella · Federico Danieli
Abstract:
StateSpace Models (SSMs), and particularly Mamba, have recently emerged as a promising alternative to Transformers. Mamba introduces input selectivity to its SSM layer (S6) and incorporates convolution and gating into its Mamba block. While these modifications do improve Mamba's performance over its SSM predecessors, it remains largely unclear how Mamba leverages the additional functionalities provided by input selectivity, and how these interact with the other operations in the Mamba architecture. In this work, we demystify the role of input selectivity in Mamba, investigating its impact on function approximation power, long-term memory, and associative recall capabilities.In particular: (i) We prove that the S6 layer of Mamba can represent projections ontoHaar wavelets, providing an edge over the S4D layer in approximating discontinuous functions commonly arising in practice. (ii) We show how the S6 layer suffers from memory decay, and describe how it can dynamically counteract it. (iii) We provide analytical solutions to the MQAR associative recall task using the Mamba architecture with different mixers --- Mamba, Mamba-2, and S4D. We demonstrate the tightness of our theoretical constructions with empirical results on concrete tasks. Our findings offer a mechanistic understanding of Mamba and reveal opportunities for improvement.
Paperid:523
Authors:Peng Wang · Yong Li · Xiu-Shen Wei · Lin Zhao
Abstract:
Finegrained hashing has become a powerful solution for rapid and efficient image retrieval, particularly in scenarios requiring high discrimination between visually similar categories. To enable each hash bit to correspond to specific visual attributes, we propose a novel method that harnesses learnable queries for attribute-aware hash code learning. This method deploys a tailored set of queries to capture and represent nuanced attribute-level information within the hashing process, thereby enhancing both the interpretability and relevance of each hash bit. Building on this query-based optimization framework, we incorporate an auxiliary branch to help alleviate the challenges of complex landscape optimization often encountered with low-bit hash codes. This auxiliary branch models high-order attribute interactions, reinforcing the robustness and specificity of the generated hash codes. Experimental results on benchmark datasets demonstrate that our method generates attribute-aware hash codes and consistently outperforms state-of-the-art techniques in retrieval accuracy and robustness, especially for low-bit hash codes, underscoring its potential in fine-grained image hashing tasks.
Paperid:524
Authors:Weihua Du · Yiming Yang · Sean Welleck
Abstract:
Multisample aggregation strategies, such as majority voting and best-of-N sampling, are widely used in contemporary large language models (LLMs) to enhance predictive accuracy across various tasks. A key challenge in this process is temperature selection, which significantly impacts model performance. Existing approaches either rely on a fixed default temperature or require labeled validation data for tuning, which are often scarce and difficult to obtain. This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different LLMs using multi-sample aggregation strategies, without relying on task-specific validation data. We provide a comprehensive analysis of temperature’s role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Additionally, we incorporate a stochastic process model to enhance interpretability, offering deeper insights into the relationship between temperature and model performance.
Paperid:525
Authors:Shyam Nuggehalli · Jifan Zhang · Lalit Jain · Robert Nowak
Abstract:
Class imbalance severely impacts machine learning performance on minority classes in realworld applications. While various solutions exist, active learning offers a fundamental fix by strategically collecting balanced, informative labeled examples from abundant unlabeled data. We introduce DIRECT, an algorithm that identifies class separation boundaries and selects the most uncertain nearby examples for annotation. By reducing the problem to one-dimensional active learning, DIRECT leverages established theory to handle batch labeling and label noise -- another common challenge in data annotation that particularly affects active learning methods.Our work presents the first comprehensive study of active learning under both class imbalance and label noise. Extensive experiments on imbalanced datasets show DIRECT reduces annotation costs by over 60\% compared to state-of-the-art active learning methods and over 80\% versus random sampling, while maintaining robustness to label noise.
Paperid:526
Authors:Peijie Dong · Zhenheng Tang · Xiang Liu · Lujun Li · Xiaowen Chu · Bo Li
Abstract:
Posttraining compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks focus narrowly on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities—workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) 4-bit quantization (GPTQ, AWQ) and 50% pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5-7B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%--3% drop) but degrades real-world application accuracy by 10%--15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios, bridging the gap between algorithmic efficiency and real-world applicability.
Paperid:527
Authors:Daniel Csillag · Claudio Struchiner · Guilherme Tegoni Goedert
Abstract:
Quality statistical inference requires a sufficient amount of data, which can be missing or hard to obtain. To this end, predictionpowered inference has risen as a promising methodology, but existing approaches are largely limited to Z-estimation problems such as inference of means and quantiles. In this paper, we apply ideas of prediction-powered inference to e-values. By doing so, we inherit all the usual benefits of e-values -- such as anytime-validity, post-hoc validity and versatile sequential inference -- as well as greatly expand the set of inferences achievable in a prediction-powered manner. In particular, we show that every inference procedure that can be framed in terms of e-values has a prediction-powered counterpart, given by our method. We showcase the effectiveness of our framework across a wide range of inference tasks, from simple hypothesis testing and confidence intervals to more involved procedures for change-point detection and causal discovery, which were out of reach of previous techniques. Our approach is modular and easily integrable into existing algorithms, making it a compelling choice for practical applications.
Paperid:528
Authors:Qiyao Liang · Daoyuan Qian · Liu Ziyin · Ila R. Fiete
Abstract:
Composition—the ability to generate myriad variations from finite means—is believed to underlie powerful generalization. However, compositional generalization remains a key challenge for deep learning. A widely held assumption is that learning disentangled (factorized) representations naturally supports this kind of extrapolation. Yet, empirical results are mixed, with many generative models failing to recognize and compose factors to generate outof-distribution (OOD) samples. In this work, we investigate a controlled 2D Gaussian "bump" generation task, demonstrating that standard generative architectures fail in OOD regions when training with partial data, even when supplied with fully disentangled (x,y) coordinates, re-entangling them through subsequent layers. By examining the model's learned kernels and manifold geometry, we show that this failure reflects a "memorization" strategy for generation through the superposition of training data rather than by combining the true factorized features. We show that models forced—through architectural modifications with regularization or curated training data—to create disentangled representations in the full-dimensional representational (pixel) space can be highly data-efficient and effective at learning to compose in OOD regions. These findings underscore that bottlenecks with factorized/disentangled representations in an abstract representation are insufficient: the model must actively maintain or induce factorization directly in the representational space in order to achieve robust compositional generalization.
Paperid:529
Authors:Scott Emmons · Caspar Oesterheld · Vincent Conitzer · Stuart Russell
Abstract:
We study partially observable assistance games (POAGs), a model of the humanAI value alignment problem which allows the human and the AI assistant to have partial observations. Motivated by concerns of AI deception, we study a qualitatively new phenomenon made possible by partial observability: would an AI assistant ever have an incentive to interfere with the human's observations? First, we prove that sometimes an optimal assistant must take observation-interferingactions, even when the human is playing optimally, and even when there are otherwise-equivalent actions available that do not interfere with observations. Though this result seems to contradict the classic theorem from single-agent decision making that the value of perfect information is nonnegative, we resolve this seeming contradiction by developing a notion of interference defined on entirepolicies. This can be viewed as an extension of the classic result that the value of perfect information is nonnegative into the cooperative multiagent setting. Second, we prove that if the human is simply making decisions based on their immediate outcomes, the assistant might need to interfere with observations as a way to query the human's preferences. We show that this incentive for interference goes away if the human is playing optimally, or if we introduce a communication channel for the human to communicate their preferences to the assistant. Third, we show that if the human acts according to the Boltzmann model of irrationality, this can create an incentive for the assistant to interfere with observations. Finally, we use an experimental model to analyze tradeoffs faced by the AI assistant in practice when considering whether or not to take observation-interfering actions.
Paperid:530
Authors:Jianze Li · Jiezhang Cao · Yong Guo · Wenbo Li · Yulun Zhang
Abstract:
Diffusion models (DMs) have significantly advanced the development of realworld image super-resolution (Real-ISR), but the computational cost of multi-step diffusion models limits their application. One-step diffusion models generate high-quality images in a one sampling step, greatly reducing computational overhead and inference latency. However, most existing one-step diffusion methods are constrained by the performance of the teacher model, where poor teacher performance results in image artifacts. To address this limitation, we propose FluxSR, a novel one-step diffusion Real-ISR technique based on flow matching models. We use the state-of-the-art diffusion model FLUX.1-dev as both the teacher model and the base model. First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR. Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss and introduce Attention Diversification Loss (ADL) as a regularization term to reduce token similarity in transformer, thereby eliminating high-frequency artifacts. Comprehensive experiments demonstrate that our method outperforms existing one-step diffusion-based Real-ISR methods. The code and models will be available.
Paperid:531
Authors:Jan Pauls · Max Zimmer · Berkant Turan · Sassan Saatchi · Philippe CIAIS · Sebastian Pokutta · Fabian Gieseke
Abstract:
With the rise in global greenhouse gas emissions, accurate largescale tree canopy height maps are essential for understanding forest structure, estimating above-ground biomass, and monitoring ecological disruptions. To this end, we present a novel approach to generate large-scale, high-resolution canopy height maps over time. Our model accurately predicts canopy height over multiple years given Sentinel 2 time series satellite data. Using GEDI LiDAR data as the ground truth for training the model, we present the first 10 m resolution temporal canopy height map of the European continent for the period 2019–2022. As part of this product, we also offer a detailed canopy height map for 2020, providing more precise estimates than previous studies. Our pipeline and the resulting temporal height map are publicly available, enabling comprehensive large-scale monitoring of forests and, hence, facilitating future research and ecological analyses. For an interactive viewer, see https://europetreemap.projects.earthengine.app/view/europeheight.
Paperid:532
Authors:Shuting He · Guangquan Jie · Changshuo Wang · Yun Zhou · Shuming Hu · Guanbin Li · Henghui Ding
Abstract:
We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that focuses on segmenting target objects in a 3D Gaussian scene based on natural language descriptions. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multimodal understanding. Developing this capability is crucial for advancing embodied AI. To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Code, trained models, and the dataset will be publicly released.
Paperid:533
Authors:Simon Park · Abhishek Panigrahi · Yun Cheng · Dingli Yu · Anirudh Goyal · Sanjeev Arora
Abstract:
Vision Language Models (VLMs) are impressive at visual question answering and image captioning. But they underperform on multistep visual reasoning---even compared to LLMs on the same tasks presented in text form---giving rise to perceptions ofmodality imbalanceorbrittleness.Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i.e., simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of the modality imbalance and how it is impacted by training strategy. We show that 1) explicit image-to-text conversion is important in promoting S2H generalization on images, by transferring reasoning from text; 2) conversion can be internalized at test time. We also report results of mechanistic study of this phenomenon. We identify measures of gradient alignment that can identify training strategies that promote better S2H generalization. Ablations highlight the importance of chain-of-thought.
Paperid:534
Authors:Ying Nie · Binwei Yan · Tianyu Guo · Zheyuan Bai · Mengyu Zheng · Hanting Chen · Jing Han
Abstract:
Despite recent advancements of finetuning open-source large language models (LLMs) to facilitate agent tasks, parameter-efficient fine-tuning (PEFT) methodologies for agent remain largely unexplored. In this paper, we introduce three key strategies for PEFT in agent tasks: 1) Inspired by the increasingly dominant Reason+Action paradigm, we first decompose the capabilities necessary for the agent tasks into three distinct roles: reasoner, executor, and summarizer. The reasoner is responsible for comprehending the user's query and determining the next role based on the execution trajectory. The executor is tasked with identifying the appropriate functions and parameters to invoke. The summarizer conveys the distilled information from conversations back to the user. 2) We then propose the Mixture-of-Roles (MoR) framework, which comprises three specialized Low-Rank Adaptation (LoRA) groups, each designated to fulfill a distinct role. By focusing on their respective specialized capabilities and engaging in collaborative interactions, these LoRAs collectively accomplish the overall agent task. 3) To effectively fine-tune the framework, we develop a multi-role data generation pipeline based on publicly available datasets, incorporating role-specific content completion and reliability verification.We conduct extensive experiments and thorough ablation studies on various LLMs and agent benchmarks, demonstrating the effectiveness of the proposed method.
Paperid:535
Authors:Yi-Fan Zhang · Min-Ling Zhang
Abstract:
Commonly used evaluation metrics in multilabel learning all involve base loss functions, and the theoretical guarantees of multi-label learning often rely on the properties of base loss functions. Some recent theoretical works have used the Lipschitz continuity of base loss functions to prove the generalization bounds for multi-label learning, but the impact of the smoothness of base loss functions on the generalization bounds is completely unknown. In an attempt to make up for this gap in the generalization theory of multi-label learning, we develop some novel vector-contraction inequalities for smooth base loss functions and derive tight generalization bounds with no dependency on the number of labels, up to logarithmic terms. We then exploit local Rademacher complexity to develop some novel local vector-contraction inequalities for smooth base loss functions, which induce generalization bounds with a tighter dependency on the number of labels and a faster convergence rate with respect to the number of examples. In addition, we derive tight generalization bounds with no dependency on the number of labels, up to logarithmic terms, for Macro-Averaged AUC by exploiting the Lipschitz continuity and smoothness of base loss functions, respectively. Our state-of-the-art theoretical results provide general theoretical guarantees for the generalization of multi-label learning.
Paperid:536
Authors:Lukas Lao Beyer · Tianhong Li · Xinlei Chen · Sertac Karaman · Kaiming He
Abstract:
Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, socalled1Dimage tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations -- such as copying and replacing tokens between latent representations of images -- enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer's latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.
Paperid:537
Authors:Ziang Chen · Xiaohan Chen · Jialin Liu · Xinshang Wang · Wotao Yin
Abstract:
Quadratic programming (QP) is the most widely applied category of problems in nonlinear programming. Many applications require realtime/fast solutions, though not necessarily with high precision. Existing methods either involve matrix decomposition or use the preconditioned conjugate gradient method. For relatively large instances, these methods cannot achieve the real-time requirement unless there is an effective preconditioner. Recently, graph neural networks (GNNs) opened new possibilities for QP. Some promising empirical studies of applying GNNs for QP tasks show that GNNs can capture key characteristics of an optimization instance and provide adaptive guidance accordingly to crucial configurations during the solving process, or directly provide an approximate solution. However, the theoretical understanding of GNNs in this context remains limited. Specifically, it is unclear what GNNs can and cannot achieve for QP tasks in theory. This work addresses this gap in the context of linearly constrained QP tasks. In the continuous setting, we prove that message-passing GNNs can universally represent fundamental properties of quadratic programs, including feasibility, optimal objective values, and optimal solutions. In the more challenging mixed-integer setting, while GNNs are not universal approximators, we identify a subclass of QP problems that GNNs can reliably represent.
Paperid:538
Authors:Christopher Scarvelis · David Benhaim · Paul Zhang
Abstract:
Orientation estimation is a fundamental task in 3D shape analysis which consists of estimating a shape's orientation axes: its side, up-, and front-axes. Using this data, one can rotate a shape into canonical orientation, where its orientation axes are aligned with the coordinate axes. Developing an orientation algorithm that reliably estimates complete orientations of general shapes remains an open problem. We introduce a two-stage orientation pipeline that achieves state of the art performance on up-axis estimation and further demonstrate its efficacy on full-orientation estimation, where one seeks all three orientation axes. Unlike previous work, we train and evaluate our method on all of Shapenet rather than a subset of classes. We motivate our engineering contributions by theory describing fundamental obstacles to orientation estimation for rotationally-symmetric shapes, and show how our method avoids these obstacles.
Paperid:539
Authors:Jongwoo Ko · Tianyi Chen · Sungnyun Kim · Tianyu Ding · Luming Liang · Ilya Zharkov · Se-Young Yun
Abstract:
Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacherand student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.
Paperid:540
Authors:Tom A. Lamb · Adam Davies · Alasdair J Paren · Phil Torr · Francesco Pinto
Abstract:
Despite the success of Instruction Tuning (IT) in training large language models (LLMs) to perform arbitrary userspecified tasks, these models often still leverage spurious or biased features learned from their training data, leading to undesired behaviours when deploying them in new contexts. In this work, we introduce Focus Instruction Tuning (FIT), which trains LLMs to condition their responses by focusing on specific features whilst ignoring others, leading to different behaviours based on what features are specified. Across several experimental settings, we show that focus-tuned models can be adaptively steered by focusing on different features at inference-time: for instance, robustness can be improved by focusing on core task features and ignoring spurious features, and social bias can be mitigated by ignoring demographic categories. Furthermore, FIT can steer behaviour in new contexts, generalising under distribution shift and to new unseen features at inference time, and thereby facilitating more robust, fair, and controllable LLM applications in real-world environments.
Paperid:541
Authors:Edith Cohen · Mihir Singhal · Uri Stemmer
Abstract:
Abstract:Cardinality sketches are compact data structures that efficiently estimate the number of distinct elements across multiple queries while minimizing storage, communication, and computational costs. However, recent research has shown that these sketches can fail under {\em adaptively chosen queries}, breaking down after approximately $\tilde{O}(k^2)$ queries, where $k$ is the sketch size.In this work, we overcome this \emph{quadratic barrier} by designing robust estimators with finegrained guarantees. Specifically, our constructions can handle an {\em exponential number of adaptive queries}, provided that each element participates in at most $\tilde{O}(k^2)$ queries. This effectively shifts the quadratic barrier from the total number of queries to the number of queries {\em sharing the same element}, which can be significantly smaller. Beyond cardinality sketches, our approach expands the toolkit for robust algorithm design.
Paperid:542
Authors:chao ying · Jun Jin · Yi Guo · Xiudi Li · Muxuan Liang · Jiwei Zhao
Abstract:
Collecting goldstandard phenotype data via manual extraction is typically labor-intensive and slow, whereas automated computational phenotypes (ACPs) offer a systematic and much faster alternative.However, simply replacing the gold-standard with ACPs, without acknowledging their differences, could lead to biased results and misleading conclusions.Motivated by the complexity of incorporating ACPs while maintaining the validity of downstream analyses, in this paper, we consider a semi-supervised learning setting that consists of both labeled data (with gold-standard) and unlabeled data (without gold-standard), under the covariate shift framework.We develop doubly robust and semiparametrically efficient estimators that leverage ACPs for general target parameters in the unlabeled and combined populations. In addition,we carefully analyze the efficiency gains achieved by incorporating ACPs, comparing scenarios with and without their inclusion.Notably, we identify that ACPs for the unlabeled data, instead of for the labeled data, drive the enhanced efficiency gains. To validate our theoretical findings, we conduct extensive synthetic experiments and apply our method to a real-world dataset, confirming the practical advantages of our approach.
Paperid:543
Authors:Jiujiang Guo · Mankun Zhao · Wenbin Zhang · Tianyi Xu · Linying Xu · Yu Jian · Yu Mei · Yu Ruiguo
Abstract:
Existing research on temporal knowledge graph completion treats temporal information as supplementary, without simulating various features of facts from a temporal perspective. This work summarizes features of temporalized facts from both diachronic and synchronic perspectives: (1) Diachronicity. Facts often exhibit varying characteristics and trends across different temporal domains; (2) Synchronicity. In specific temporal contexts, various relations between entities influence each other, generating latent semantics. To track above issues, we design a quaternionbased model, TeDS, which divides timestamps into diachronic and synchronic timestamps to support dual temporal perception: (a) Two composite quaternions fusing time and relation information are generated by reorganizing synchronic timestamp and relation quaternions, and Hamilton operator achieves their interaction. (b) Each time point is sequentially mapped to an angle and converted to scalar component of a quaternion using trigonometric functions to build diachronic timestamps. We then rotate relation by using Hamilton operator between it and diachronic timestamp. In this way, TeDS achieves deep integration of relations and time while accommodating different perspectives. Empirically, TeDS significantly outperforms SOTA models on six benchmarks.
Paperid:544
Authors:Gaurav Menghani · Ravi Kumar · Sanjiv Kumar
Abstract:
One of the core pillars of efficient deep learning methods is architectural improvements, such as residual/skip connections, which have led to significantly better model convergence and quality. Since their introduction, residual connections have become ubiquitous not only in convolutional neural networks but also in transformerbased architectures, the backbone of LLMs.In this paper, we introduce the Learned Augmented Residual Layer (LAuReL)---a novel generalization of the canonical residual connection---designed to serve as an in-situ replacement while outperforming it in both model quality and footprint metrics. Our experiments show that LAuReL can enhance model quality for both vision and language models while adding fewer parameters and incurring less latency and memory overhead than naively increasing parameter count.For example, on the ImageNet-1K task, LAuReL achieves the same model quality improvements as naively adding an extra layer while using 2.6x fewer parameters. Similarly, when pre-training 1B and 4B parameter LLMs, LAuReL improves performance on a variety of challenging downstream evaluation tasks by 2.54% to 20.05%, while adding only 0.012% and 0.1% additional parameters, respectively.
Paperid:545
Authors:Yejiang Wang · Yuhai Zhao · Zhengkui Wang · Wen Shan · Ling Li · Qian Li · Miaomiao Huang · Meixia Wang · Shirui Pan · Xingwei Wang
Abstract:
Graphs, fundamental in modeling various research subjects, consist of nodes linked by edges. However, they typically function as components within larger structures in realworld scenarios, such as in protein-protein interactions where each protein is a graph in a larger network. This study delves into the Graph-of-Net (GON), a structure that extends the concept of traditional graphs by representing each node as a graph itself. It provides a multi-level perspective on the relationships between objects, encapsulating both the detailed structure of individual nodes and the broader network of dependencies. To learn node representations within the GON, we propose an position-aware neural network for Graph-of-Net which processes both intra-graph and inter-graph connections and incorporates additional data like node labels. Our model employs dual encoders and graph constructors to build and refine a constraint network, where nodes are adaptively arranged based on their positions, as determined by the network's constraint system. Our model demonstrates significant improvements over baselines in empirical evaluations on various datasets.
Paperid:546
Authors:Ruiyuan Huang · Zengfeng Huang
Abstract:
Motivated by applications in online bidding and sleeping bandits, we examine the problem of contextual bandits with cross learning, where the learner observes the loss associated with the action across all possible contexts, not just the current round’s context. Our focus is on a setting where losses are chosen adversarially, and contexts are sampled i.i.d. from a specific distribution. This problem was first studied by Balseiro et al. (2019), who proposed an algorithm that achieves nearoptimal regret under the assumption that the context distribution is known in advance. However, this assumption is often unrealistic. To address this issue, Schneider & Zimmert (2023) recently proposed a new algorithm that achieves nearly optimal expected regret. It is well-known that expected regret can be significantly weaker than high-probability bounds. In this paper, we present a novel, in-depth analysis of their algorithm and demonstrate that it actually achieves near-optimal regret with high probability. There are steps in the original analysis by Schneider & Zimmert (2023) that lead only to an expected bound by nature. In our analysis, we introduce several new insights. Specifically, we make extensive use of the weak dependency structure between different epochs, which was overlooked in previous analyses. Additionally, standard martingale inequalities are not directly applicable, so we refine martingale inequalities to complete our analysis.
Paperid:547
Authors:Kian Kenyon-Dean · Zitong Jerry Wang · John Urbanik · Konstantin Donhauser · Jason Hartford · Saber Saberian · Nil Sahin · Ihab Bendidi · Safiye Celik · Juan Vera · Marta Fay · Imran Haque · Oren Kraus
Abstract:
Deriving insights from experimentally generated datasets requires methods that can account for random and systematic measurement errors and remove them in order to accurately represent the underlying effects of the conditions being tested. Here we present a framework for pretraining on largescale microscopy datasets that includes three steps: (1) curating a set of diverse and self-consistent training samples, (2) scaling training of an appropriate foundation model architecture on this dataset, (3) evaluating intermediate layers of the trained model to identify the best representation for downstream tasks. Using this strategy, we present the largest foundation model for cell microscopy data to our knowledge, a new 1.9 billion-parameter ViT-G/8 MAE trained on over 8 billion microscopy image crops. Compared to a previous published ViT-L/8 MAE, our new model achieves a 60% improvement in linear separability of genetic perturbations and obtains the best overall performance on whole-genome relationship recall, batch correction replicate consistency, and compound-gene activity prediction benchmarks.
Paperid:548
Authors:Jae-Hong Lee
Abstract:
Testtime adaptation (TTA) addresses the machine learning challenge of adapting models to unlabeled test data from shifting distributions in dynamic environments. A key issue in this online setting arises from using unsupervised learning techniques, which introduce explicit gradient noise that degrades model weights. To invest in weight degradation, we propose a Bayesian weight enhancement framework, which generalizes existing weight-based TTA methods that effectively mitigate the issue. Our framework enables robust adaptation to distribution shifts by accounting for diverse weights by modeling weight distributions.Building on our framework, we identify a key limitation in existing methods: their neglect of time-varying covariance reflects the influence of the gradient noise. To address this gap, we propose a novel steady-state adaptation (SSA) algorithm that balances covariance dynamics during adaptation. SSA is derived through the solution of a stochastic differential equation for the TTA process and online inference. The resulting algorithm incorporates a covariance-aware learning rate adjustment mechanism. Through extensive experiments, we demonstrate that SSA consistently improves state-of-the-art methods in various TTA scenarios, datasets, and model architectures, establishing its effectiveness in instability and adaptability.
Paperid:549
Authors:Boyang Sun · Yu Yao · Xinshuai Dong · Zongfang Liu · Tongliang Liu · Yumou Qiu · Kun Zhang
Abstract:
Conditional independence (CI) test is a fundamental concept in statistics. In many realworld scenarios, some variables may be difficult to measure accurately, often leading to data being represented as discretized values. Applying CI tests directly to discretized data, however, can lead to incorrect conclusions about the independence of latent variables. To address this, recent advancements have sought to infer the correct CI relationship between the latent variables by binarizing the observed data. However, this process results in a loss of information, which degrades the test's performance, particularly with small sample sizes. Motivated by this, this paper introduces a new sample-efficient CI test that does not rely on the binarization process. We find that the relationship can be established by addressing an \textit{over-identifying} restriction problem with \textit{Generalized Method of Moments} (GMM). Based on this finding, we have designed a new test statistic, and its asymptotic distribution has been derived. Empirical results across various datasets show that our method consistently outperforms existing ones.
Paperid:550
Authors:Subhash Kantamneni · Josh Engels · Senthooran Rajamanoharan · Max Tegmark · Neel Nanda
Abstract:
Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a ground truth for the concepts used by an LLM, and a growing number of works have presented problems with current SAEs. One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines. We test this by applying SAEs to the realworld task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. Due to the difficulty of detecting concepts in these challenging settings, we hypothesize that SAEs’ basis of interpretable, concept-level latents should provide a useful inductive bias. However, although SAEs occasionally perform better than baselines on individual datasets, we are unable to design ensemble methods combining SAEs with baselines that consistently outperform ensemble methods solely using baselines. Additionally, although SAEs initially appear promising for identifying spurious correlations, detecting poor dataset quality, and training multi-token probes, we are able to achieve similar results with simple non-SAE baselines as well. Though we cannot discount SAEs’ utility on other tasks, our findings highlight the shortcomings of current SAEs and the need to rigorously evaluate interpretability methods on downstream tasks with strong baselines.
Paperid:551
Authors:Guixiang Wang · Jianjun Li
Abstract:
Motion transfer is to transfer pose in driving video to object of source image, so that object of source image moves. Although great progress has been made recently in unsupervised motion transfer, many unsupervised methods still struggle to accurately model large displacement motions when large motion differences occur between source and driving images. To solve the problem, we propose an unsupervised anytime interpolation based large displacement motion transfer method, which can generate a series of anytime interpolated images between source and driving images. By decomposing large displacement motion into many small displacement motions, difficulty of large displacement motion estimation is reduced. In the process, we design a selector that can select optimal interpolated image from generated interpolated images for downstream tasks. Since there are no real images as labels in the interpolation process, we propose a bidirectional training strategy. Some constraints are added to optimal interpolated image to generate a reasonable interpolated image. To encourage network to generate highquality images, a pre-trained Vision Transformer model is used to design constraint losses. Finally, experiments show that compared with the large displacement motion between source and driving images, small displacement motion between interpolated and driving images is easier to realize motion transfer. Compared with existing state-of-art methods, our method has significant improvements in motion-related metrics.
Paperid:552
Authors:Javan Tahir · Surya Ganguli · Grant Rotskoff
Abstract:
With the emergence of largescale pre-trained neural networks, methods to adapt such foundation models to data-limited downstream tasks have become a necessity. Fine-tuning, preference optimization, and transfer learning have all been successfully employed for these purposes when the target task closely resembles the source task, but a precise theoretical understanding of task similarity is still lacking. We adopt afeature-centricviewpoint on transfer learning and establish a number of theoretical results that demonstrate that when the target task is well represented by the feature space of the pre-trained model, transfer learning outperforms training from scratch. We study deep linear networks as a minimal model of transfer learning in which we can analytically characterize the transferability phase diagram as a function of the target dataset size and the feature space overlap. For this model, we establish rigorously that when the feature space overlap between the source and target tasks is sufficiently strong, both linear transfer and fine-tuning improve performance, especially in the low data limit. These results build on an emerging understanding of feature learning dynamics in deep linear networks, and we demonstrate numerically that the rigorous results we derive for the linear case also apply to nonlinear networks.
Paperid:553
Authors:David Reber · Sean Richardson · Todd Nief · Cristina Garbacea · Victor Veitch
Abstract:
Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs.However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewritebased Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures thecausaleffect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewritingtwice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.
Paperid:554
Authors:Yung-Sung Chuang · Benjamin Cohen-Wang · Shannon Shen · Zhaofeng Wu · Hu Xu · Xi Victoria Lin · Shang-Wen Li · James Glass · Scott Yih
Abstract:
We introduce SelfCite, a novel selfsupervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself throughcontext ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks.
Paperid:555
Authors:Zhitong Xu · Da Long · Yiming Xu · Guang Yang · Shandian Zhe · Houman Owhadi
Abstract:
We introduce a novel kernel learning framework toward efficiently solving nonlinear partial differential equations (PDEs). In contrast to the stateof-the-art kernel solver that embeds differential operators within kernels, posing challenges with a large number of collocation points, our approach eliminates these operators from the kernel. We model the solution using a standard kernel interpolation form and differentiate the interpolant to compute the derivatives. Our framework obviates the need for complex Gram matrix construction between solutions and their derivatives, allowing for a straightforward implementation and scalable computation. As an instance, we allocate the collocation points on a grid and adopt a product kernel, which yields a Kronecker product structure in the interpolation. This structure enables us to avoid computing the full Gram matrix, reducing costs and scaling efficiently to a large number of collocation points. We provide a proof of the convergence and rate analysis of our method under appropriate regularity assumptions. In numerical experiments, we demonstrate the advantages of our method in solving several benchmark PDEs.
Paperid:556
Authors:Khai Nguyen · Hai Nguyen · Tuan Pham · Nhat Ho
Abstract:
We introduce sliced optimal transport dataset distance (sOTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, we derive a data point projection that transforms datasets into one-dimensional distributions. The s-OTDD is defined as the expected Wasserstein distance between the projected distributions, with respect to random projection parameters. Leveraging the closed form solution of one-dimensional optimal transport, s-OTDD achieves (near-)linear computational complexity in the number of data points and feature dimensions and is independent of the number of classes. With its geometrically meaningful projection, s-OTDD strongly correlates with the optimal transport dataset distance while being more efficient than existing dataset discrepancy measures. Moreover, it correlates well with the performance gap in transfer learning and classification accuracy in data augmentation.
Paperid:557
Authors:Jiawei Ge · Yuanhao Wang · Wenzhe Li · Chi Jin
Abstract:
Abstract:This paper examines multiplayer symmetric constantsum games with more than two players in a competitive setting, such as Mahjong, Poker, and various board and video games. In contrast to two-player zero-sum games, equilibria in multiplayer games are neither unique nor non-exploitable, failing to provide meaningful guarantees when competing against opponents who play different equilibria or non-equilibrium strategies. This gives rise to a series of long-lasting fundamental questions in multiplayer games regarding suitable objectives, solution concepts, and principled algorithms. This paper takes an initial step towards addressing these challenges by focusing on the natural objective of *equal share*—securing an expected payoff of $C/n$ in an $n$-player symmetric game with a total payoff of $C$. We rigorously identify the theoretical conditions under which achieving an equal share is tractable and design a series of efficient algorithms, inspired by no-regret learning, that *provably* attain approximate equal share across various settings. Furthermore, we provide complementary lower bounds that justify the sharpness of our theoretical results. Our experimental results highlight worst-case scenarios where meta-algorithms from prior state-of-the-art systems for multiplayer games fail to secure an equal share, while our algorithm succeeds, demonstrating the effectiveness of our approach.
Paperid:558
Authors:Supratim Shit · Gurmehak chadha · Surendra kumar · Bapi Chatterjee
Abstract:
Coreset, as a summary of training data, offers an efficient approach for reducing data processing and storage complexity during training. In the emerging vertical federated learning (VFL) setting, where scattered clients store different data features, it directly reduces communication complexity. In this work, we introduce coresets construction for regularized logistic regression both in centralized and VFL settings. Additionally, we improve the coreset size for regularized linear regression in the VFL setting. We also eliminate the dependency of the coreset size on a property of the data due to the VFL setting. The improvement in the coreset sizes is due to our novel coreset construction algorithms that capture the reduced model complexity due to the added regularization and its subsequent analysis. In experiments, we provide extensive empirical evaluation that backs our theoretical claims. We also compare the performance of our coresets in terms of the distance between the models trained on the complete data and coreset.
Paperid:559
Authors:Raja Sunkara · Ardhendu Tripathy
Abstract:
Abstract:Applications such as engineering design often require us to optimize a blackbox function, i.e., a system whose inner processing is not analytically known and whose gradients are not available. Practitioners often have a fixed budget for the number of function evaluations and the performance of an optimization algorithm is measured by its simple regret. In this paper, we study the class of ``Optimistic Optimization'' algorithms for black-box optimization that use a partitioning scheme for the domain. We develop algorithms that learn a good partitioning scheme and use flexible surrogate models such as neural networks in the optimization procedure. For multi-index functions on an $m$-dimensional subspace within $d$ dimensions, our algorithm attains $\tilde{O}(n^{-\beta / d})$ regret, where $\beta = 1 + \frac{d-m}{2m-1}$, as opposed to $\tilde{O}(n^{-1/d})$ for SequOOL, a state-of-the-art optimistic optimization algorithm. Our approach is competitive across a wide range of numerical benchmarks. Additionally, we introduce weight quantization in a large language model as a novel task for black-box optimization. Our approach improves the quality of Activation-aware Weight Quantization (AWQ) of the OPT-1.3B model, achieving an approximate 10\% improvement in performance relative to the best possible unquantized model.
Paperid:560
Authors:Nate Veldt · Thomas Stanley · Benjamin Priest · Trevor Steil · Keita Iwabuchi · T.S. Jayram · Geoffrey Sanders
Abstract:
Abstract:Finding a minimum spanning tree (MST) for $n$ points in an arbitrary metric space is a fundamental primitive for hierarchical clustering and many other ML tasks, but this takes $\Omega(n^2)$ time to even approximate. We introduce a framework for metric MSTs that first (1) finds a forest of trees using practical heuristics, and then (2) finds a small weight set of edges to connect disjoint components in the forest into a spanning tree. We prove that optimally solving step (2) still takes $\Omega(n^2)$ time, but we provide a subquadratic 2.62approximation algorithm. In the spirit of learning-augmented algorithms, we then show that if the heuristic forest found in step (1) overlaps with an optimal MST, we can approximate the original MST problem in subquadratic time, where the approximation factor depends on a measure of overlap. In practice, we find nearly optimal spanning trees for a wide range of metrics, while being orders of magnitude faster than exact algorithms.
Paperid:561
Authors:Charita Dellaporta · Patrick O'Hara · Theodoros Damoulas
Abstract:
Decision making under uncertainty is challenging as the datagenerating process (DGP) is often unknown. Bayesian inference proceeds by estimating the DGP through posterior beliefs on the model’s parameters. However, minimising the expected risk under these beliefs can lead to suboptimal decisions due to model uncertainty or limited, noisy observations. To address this, we introduce Distributionally Robust Optimisation with Bayesian Ambiguity Sets (DRO-BAS) which hedges against model uncertainty by optimising the worst-case risk over a posterior-informed ambiguity set. We provide two such sets, based on the posterior expectation (DRO-BAS(PE)) or the posterior predictive (DRO-BAS(PP)) and prove that both admit, under conditions, strong dual formulations leading to efficient single-stage stochastic programs which are solved with a sample average approximation. For DRO-BAS(PE), this covers all conjugate exponential family members while for DRO-BAS(PP) this is shown under conditions on the predictive's moment generating function. Our DRO-BAS formulations outperform existing Bayesian DRO on the Newsvendor problem and achieve faster solve times with comparable robustness on the Portfolio problem.
Paperid:562
Authors:Weiqiu You · Helen Qu · Marco Gatti · Bhuvnesh Jain · Eric Wong
Abstract:
Selfattributing neural networks (SANNs) present a potential path towards interpretable models for high-dimensional problems, but often face significant trade-offs in performance. In this work, we formally prove a lower bound on errors of per-feature SANNs, whereas group-based SANNs can achieve zero error and thus high performance. Motivated by these insights, we propose Sum-of-Parts (SOP), a framework that transforms any differentiable model into a group-based SANN, where feature groups are learned end-to-end without group supervision. SOP achieves state-of-the-art performance for SANNs on vision and language tasks, and we validate that the groups are interpretable on a range of quantitative and semantic metrics. We further validate the utility of SOP explanations in model debugging and cosmological scientific discovery.
Paperid:563
Authors:Liam Hodgkinson · Zhichao Wang · Michael Mahoney
Abstract:
Recent theoretical and empirical successes in deep learning, including the celebrated neural scaling laws, are punctuated by observations that many objects of interest tend to exhibit some form of heavytailed or power law behavior. In particular, the prevalence of heavy-tailed spectral densities in the Jacobian, Hessian, and weight matrices of deep neural networks has recently been coined asheavy-tailed mechanistic universality(HT-MU). Empirical evidence suggests a robust correlation between power law exponents and model performance, indicating that HT-MU may be a fundamental aspect of deep learning efficacy. Here, we propose a general family of random feature matrix models---thehigh-temperature inverse-Wishart ensemble---to explore the attributes which give rise to heavy-tailed behavior in trained neural network models. Under this model, spectral densities with power laws on both tails arise through a combination of three independent factors: (i) complex correlation structures in the data; (ii) reduced temperatures during training; and (iii) reduced eigenvector entropy appearing as implicit bias in the model structure. Implications of our model on other appearances of heavy tails, including the five phases of learning, neural scaling laws, and optimizer trajectories, are also discussed.
Paperid:564
Authors:Dihan Zheng · Bo Huang
Abstract:
Fluorescence microscopy is ubiquitously used in cell biology research to characterize the cellular role of a protein. To help elucidate the relationship between the amino acid sequence of a protein and its cellular function, we introduce CELLDiff, a unified diffusion model facilitating bidirectional transformations between protein sequences and their corresponding microscopy images. Utilizing reference cell morphology images and a protein sequence, CELL-Diff efficiently generates corresponding protein images. Conversely, given a protein image, the model outputs protein sequences. CELL-Diff integrates continuous and diffusion models within a unified framework and is implemented using a transformer-based network. We train CELL-Diff on the Human Protein Atlas (HPA) dataset and fine-tune it on the OpenCell dataset. Experimental results demonstrate that CELL-Diff outperforms existing methods in generating high-fidelity protein images, making it a practical tool for investigating subcellular protein localization and interactions.
Paperid:565
Authors:Mengmeng Ma · Tang Li · Yunxiang Peng · LIN LU · Volkan Beylergil · Binsheng Zhao · Oguz Akin · Xi Peng
Abstract:
Medical AI models excel at tumor detection and segmentation. However, their latent representations often lack explicit ties to clinical semantics, producing outputs less trusted in clinical practice. Most of the existing models generate either segmentation masks/labels (localizing where without why) or textual justifications (explaining why without where), failing to ground clinical concepts in spatially localized evidence. To bridge this gap, we propose to develop models that can justify the segmentation or detection using clinically relevant terms and point to visual evidence. We address two core challenges: First, we curate a rationale dataset to tackle the lack of paired images, annotations, and textual rationales for training. The dataset includes 180K imagemask-rationale triples with quality evaluated by expert radiologists. Second, we design rationale-informed optimization that disentangles and localizes fine-grained clinical concepts in a self-supervised manner without requiring pixel-level concept annotations. Experiments across medical benchmarks show our model demonstrates superior performance in segmentation, detection, and beyond. The anonymous link to our code.
Paperid:566
Authors:Hang Zhou · Yuezhou Ma · Haixu Wu · Haowen Wang · Mingsheng Long
Abstract:
Deep models have recently emerged as promising tools to solve partial differential equations (PDEs), known as neural PDE solvers. While neural solvers trained from either simulation data or physicsinformed loss can solve PDEs reasonably well, they are mainly restricted to a few instances of PDEs, e.g. a certain equation with a limited set of coefficients. This limits their generalization to diverse PDEs, preventing them from being practical surrogate models for classical numerical solvers. In this paper, we present the Universal Neural PDE Solver (Unisolver) capable of solving a wide scope of PDEs by training a novel Transformer model on diverse data and conditioned on diverse PDEs. Instead of purely scaling up data and parameters, Unisolver stems from the theoretical analysis of the PDE-solving process. Inspired by the mathematical structure of PDEs that a PDE solution is fundamentally governed by a series of PDE components such as equation symbols, coefficients, and boundary conditions, we define a complete set of PDE components and flexibly embed them as domain-wise and point-wise deep conditions for Transformer PDE solvers. Integrating physical insights with recent Transformer advances, Unisolver achieves consistent state-of-the-art results on three challenging large-scale benchmarks, showing impressive performance gains and favorable PDE generalizability.
Paperid:567
Authors:Sai Advaith Maddipatla · Nadav Bojan · Meital Bojan · Sanketh Vedula · Paul Schanda · Ailie Marx · Alexander Bronstein
Abstract:
Proteins exist as a dynamic ensemble of multiple conformations, and these motions are often crucial for their functions. However, current structure prediction methods predominantly yield a single conformation, overlooking the conformational heterogeneity revealed by diverse experimental modalities. Here, we present a framework for building experimentgrounded protein structure generative models that infer conformational ensembles consistent with measured experimental data. The key idea is to treat state-of-the-art protein structure predictors (e.g., AlphaFold3) as sequence-conditioned structural priors, and cast ensemble modeling as posterior inference of protein structures given experimental measurements. Through extensive real-data experiments, we demonstrate the generality of our method to incorporate a variety of experimental measurements. In particular, our framework uncovers previously unmodeled conformational heterogeneity from crystallographic densities, generates high-accuracy NMR ensembles orders of magnitude faster than status quo, and incorporates pairwise cross-link constraints. Notably, we demonstrate that our ensembles outperform AlphaFold3 and sometimes better fit experimental data than publicly deposited structures to the protein database (PDB). We believe that this approach will unlock building predictive models that fully embrace experimentally observed conformational diversity.
Paperid:568
Authors:Fangxu Yu · Lai Jiang · Haoqiang Kang · Shibo Hao · Lianhui Qin
Abstract:
The ability to generate diverse solutions to a given problem is a hallmark of human creativity. This divergent reasoning is also crucial for machines, enhancing their robustness and enabling them to assist humans in many applications such as scientific discovery. However, existing approaches to multistep reasoning with large language models (LLMs) have mostly focused only on reasoning accuracy, without further discovering more diverse valid solutions. For example, supervised fine-tuning can improve LLM reasoning quality, but requires extensive supervised data to capture the full range of possible solutions. Reinforcement learning aims to find limited highest-reward solutions while neglecting the solution diversity. To fill this gap, we propose Flow of Reasoning (FoR), an efficient diversity-seeking LLM finetuning method aimed at improving reasoning quality and diversity with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow on a DAG-structured reasoning graph. This formulation allows us to incorporate and adapt principled GFlowNet approaches, for finetuning LLMs to sample divergent paths with probabilities proportional to the (unnormalized) reward of target problems. Extensive experiments show that, with limited training examples (e.g., 15 examples), FoR enables the discovery of diverse, creative, high-quality solutions, greatly outperforming a wide range of existing inference and training methods across six challenging reasoning tasks, including BlocksWorld (embodied reasoning), Game24 (math puzzle solving), Rubik's Cube (spatial reasoning), 1D-ARC (abstraction reasoning), GSM8k (math reasoning), and ProntoQA (logical reasoning).
Paperid:569
Authors:Daniil Tiapkin · Daniele Calandriello · Johan Ferret · Sarah Perrin · Nino Vieillard · Alexandre Rame · Mathieu Blondel
Abstract:
Posttraining of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model, leading to degraded performance on the true objective, in line with Goodhart's law.In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher.Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking.Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust LMs.
Paperid:570
Authors:chengqian gao · Haonan Li · Liu Liu · Zeke Xie · Peilin Zhao · Zhiqiang Xu
Abstract:
The alignment of large language models (LLMs) often assumes that using more clean data yields better outcomes, overlooking the match between model capacity and example difficulty. Challenging this, we propose a new principle:Preference data vary in difficulty, and overly difficult examples hinder alignment, by exceeding the model's capacity. Through systematic experimentation, we validate this principle with three key findings: (1) preference examples vary in difficulty, as evidenced by consistent learning orders across alignment runs; (2) overly difficult examples significantly degrade performance across four LLMs and two datasets; and (3) the capacity of a model dictates its threshold for handling difficult examples, underscoring a critical relationship between data selection and model capacity. Building on this principle, we introduceSelective DPO, which filters out overly difficult examples. This simple adjustment improves alignment performance by 916\% in win rates on the AlpacaEval 2 benchmark compared to the DPO baseline, suppressing a series of DPO variants with different algorithmic adjustments. Together, these results illuminate the importance of aligning data difficulty with model capacity, offering a transformative perspective for improving alignment strategies in LLMs.
Paperid:571
Authors:Huanjian Zhou · Masashi Sugiyama
Abstract:
Abstract:Sampling from highdimensional probability distributions is fundamental in machine learning and statistics. As datasets grow larger, computational efficiency becomes increasingly important, particularly in reducing \emph{adaptive complexity}, namely the number of sequential rounds required for sampling algorithms. While recent works have introduced several parallelizable techniques, they often exhibit suboptimal convergence rates and remain significantly weaker than the latest lower bounds for log-concave sampling.To address this, we propose a novel parallel sampling method that improves adaptive complexity dependence on dimension $d$ reducing it from $\widetilde{\mathcal{O}}(\log^2 d)$ to $\widetilde{\mathcal{O}}(\log d)$, which is even optimal for log-concave sampling with some specific adaptive complexity. Our approach builds on parallel simulation techniques from scientific computing.
Paperid:572
Authors:Wenhui Zhu · Peijie Qiu · Xiwen Chen · Zhangsihao Yang · Aristeidis Sotiras · Abolfazl Razi · Yalin Wang
Abstract:
Multiple Instance Learning (MIL) is a popular weaklysupervised method for various applications, with a particular interest in histological whole slide image (WSI) classification. Due to the gigapixel resolution of WSI, applications of MIL in WSI typically necessitate a two-stage training scheme: first, extract features from the pre-trained backbone and then perform MIL aggregation. However, it is well-known that this suboptimal training scheme suffers from "noisy" feature embeddings from the backbone and inherent weak supervision, hindering MIL from learning rich and generalizable features. However, the most commonly used technique (i.e., dropout) for mitigating this issue has yet to be explored in MIL. In this paper, we empirically explore how effective the dropout can be in MIL. Interestingly, we observe that dropping the top-k most important instances within a bag leads to better performance and generalization even under noise attack. Based on this key observation, we propose a novel MIL-specific dropout method, termed MIL-Dropout, which systematically determines which instances to drop. Experiments on five MIL benchmark datasets and two WSI datasets demonstrate that MIL-Dropout boosts the performance of current MIL methods with a negligible computational cost. The code will be published after acceptance.
Paperid:573
Authors:Adam Karvonen · Can Rager · Johnny Lin · Curt Tigges · Joseph Bloom · David Chanin · Callum McDougall · Yeu-Tong Lau · Eoin Farrell · Arthur Conmy · Kola Ayonrinde · Demian Till · Matthew Wearden · Samuel Marks · Neel Nanda
Abstract:
Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across seven diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we opensource a suite of over 200 SAEs across seven recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs at www.saebench.xyz.
Paperid:574
Authors:Haruka Kiyohara · Fan Yao · Sarah Dean
Abstract:
In twosided platforms (e.g., video streaming or e-commerce), viewers and providers engage in interactive dynamics, where an increased provider population results in higher viewer utility and the increase of viewer population results in higher provider utility. Despite the importance of such ``population effects'' on long term platform health, recommendation policies do not generally take the participation dynamics into account. This paper thus studies the dynamics and policy design on two-sided platforms under the population effects for the first time. Our control- and game-theoretic findings warn against the use of myopic-greedy policy and shed light on the importance of provider-side considerations (i.e., effectively distributing exposure among provider groups) to improve social welfare via population growth. We also present a simple algorithm to optimize long-term social welfare by taking the population effects into account, and demonstrate its effectiveness in synthetic and real-data experiments.
Paperid:575
Authors:Kelsey Allen · Carl Doersch · Guangyao Zhou · Mohammed Suhail · Danny Driess · Ignacio Rocco · Yulia Rubanova · Thomas Kipf · Mehdi S. M. Sajjadi · Kevin Murphy · Joao Carreira · Sjoerd van Steenkiste
Abstract:
A current limitation of video generative video models is that they generate plausible looking frames, but poor motion an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), and reconstruction errors for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives.
Paperid:576
Authors:Ye DU · Chen Yang · Nanxi Yu · Wanyu LIN · Qian Zhao · Shujun Wang
Abstract:
Abstract:*De novo* peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cuttingedge models encode the observed mass spectra into latent representations from which peptides are predicted auto-regressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$mputation before $\underline{\textbf{P}}$rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. The code will be made publicly available.
Paperid:577
Authors:Konstantinos Vergopoulos · Mark Müller · Martin Vechev
Abstract:
Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWEBench - a benchmark that challenges code agents to generate patches addressing GitHub issues given the full repository as context and then evaluates their correctness by executing the human-written test suite extracted from the repository after the issue's resolution.However, constructing benchmarks like SWE-bench requires substantial manual effort to set up historically accurate execution environments for testing. Crucially, this severely limits the number of considered repositories, e.g., just 12 for SWE-Bench. A danger of such a selection process is a distributional mismatch, i.e., the measured performance may not be representative of real-world scenarios potentially misguiding development efforts. In this work, we address this challenge and introduce SetUpAgent, a fully automated system capable of historically accurate dependency setup, test execution, and result parsing. Using SetUpAgent, we generate two new datasets: (i) an extended version of SWE-Bench encompassing hundreds of repositories, and (ii) a benchmark centered on applications rather than libraries. Comparing these datasets to SWE-Bench with respect to their characteristics and code agent performance, we find significant distributional differences, including lower issue description quality and detail level, higher fix complexity, and most importantly up to 40% lower agent success rates.
Paperid:578
Authors:Yannis Montreuil · Shu Heng Yeo · Wei Tsang Ooi · Lai Xing Ng · Axel Carlier
Abstract:
Abstract:The TwoStage Learning-to-Defer framework has been extensively studied for classification and, more recently, regression tasks. However, many contemporary applications involve both classification and regression in an interdependent manner. In this work, we introduce a novel Two-Stage Learning-to-Defer framework for multi-task learning that jointly addresses these tasks. Our approach leverages a two-stage surrogate loss family, which we prove to be both ($\mathcal{G}, \mathcal{R}$)-consistent and Bayes-consistent, providing strong theoretical guarantees of convergence to the Bayes-optimal rejector. We establish consistency bounds explicitly linked to the cross-entropy surrogate family and the $L_1$-norm of the agents' costs, extending the theoretical minimizability gap analysis to the two-stage setting with multiple experts. We validate our framework on two challenging tasks: object detection, where classification and regression are tightly coupled, and existing methods fail, and electronic health record analysis, in which we highlight the suboptimality of current learning-to-defer approaches.
Paperid:579
Authors:Da Xiao · Qingye Meng · Shengping Li · xingyuan yuan
Abstract:
We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance crosslayer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block.MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving performance of Transformers trained with ~1.8x--2.4x compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code and pre-trained models in JAX and PyTorch will be released upon acceptance.
Paperid:580
Authors:Leonardo Hernandez Cano · Maxine Perroni-Scharf · Neil Dhir · Arun Ramamurthy · Armando Solar-Lezama
Abstract:
We present Structured World Modeling for Policy Optimization (SWMPO), a framework for unsupervised learning of neurosymbolic Finite State Machines (FSM) that capture environmental structure for policy optimization. Traditional unsupervised world modeling methods rely on unstructured representations, such as neural networks, that do not explicitly represent highlevel patterns within the system (e.g., patterns in the dynamics of regions such as \emph{water} and \emph{land}).Instead, SWMPO models the environment as a finite state machine (FSM), where each state corresponds to a specific region with distinct dynamics. This structured representation can then be leveraged for tasks like policy optimization. Previous works that synthesize FSMs for this purpose have been limited to discrete spaces, not continuous spaces. Instead, our proposed FSM synthesis algorithm operates in an unsupervised manner, leveraging low-level features from unprocessed, non-visual data, making it adaptable across various domains. The synthesized FSM models are expressive enough to be used in a model-based Reinforcement Learning scheme that leverages offline data to efficiently synthesize environment-specific world models.We demonstrate the advantages of SWMPO by benchmarking its environment modeling capabilities in simulated environments.
Paperid:581
Authors:JUNHAO HU · Wenrui Huang · Weidong Wang · Haoyi Wang · tiancheng hu · zhang qin · Hao Feng · Xusheng Chen · Yizhou Shan · Tao Xie
Abstract:
Large Language Models (LLMs) are critical fora wide range of applications, but serving themefficiently becomes increasingly challenging asinputs become more complex. Context cachingimproves serving performance by exploiting interrequest dependency and reusing keyvalue (KV)cache across requests, thus improving time-tofirst-token (TTFT). However, existing prefixbased context caching requires exact token prefixmatches, limiting cache reuse in few-shot learning, multi-document QA, or retrieval-augmentedgeneration, where prefixes may vary while somecontent, like files, remains unchanged. In this paper, we present PICI, an LLM serving system thatintroduces position-independent context caching(PIC), enabling modular KV cache reuse regardless of prefix and token chunk position. PICI incorporates a LegoLink algorithm, which leverages static attention sparsity, eliminating the influence of the attention sink phenomenon in eachchunk, to minimize recomputation for accuracy recovery when using PIC. Our experiments demonstrate that PICI delivers up to 8× improvements inTTFT and 7× throughput over existing systems,with negligible or no accuracy loss.
Paperid:582
Authors:Joongkyu Lee · Min-hwan Oh
Abstract:
In this paper, we consider combinatorial reinforcement learning with preference feedback, where a learning agent sequentially offers an action—an assortment of multiple items—to a user, whose preference feedback follows a multinomial logistic (MNL) model. This framework allows us to model realworld scenarios, particularly those involving long-term user engagement, such as in recommender systems and online advertising. However, this framework faces two main challenges: (1) the unknown value of each item, unlike traditional MNL bandits that only address single-step preference feedback, and (2) the difficulty of ensuring optimism while maintaining tractable assortment selection in the combinatorial action space with unknown values. In this paper, we assume a contextual MNL preference model, where the mean utilities are linear, and the value of each item is approximated by a general function. We propose an algorithm, MNL-VQL, that addresses these challenges, making it both computationally and statistically efficient. As a special case, for linear MDPs (with the MNL preference feedback), we establish the first regret lower bound in this framework and show that MNL-VQL achieves near-optimal regret. To the best of our knowledge, this is the first work to provide statistical guarantees in combinatorial RL with preference feedback.
Paperid:583
Authors:Ganchao Wei · Li Ma
Abstract:
Flow matching (FM) is a family of training algorithms for fitting continuous normalizing flows (CNFs). Conditional flow matching (CFM) exploits the fact that the marginal vector field of a CNF can be learned by fitting leastsquares regression to the conditional vector field specified given one or both ends of the flow path. In this paper, we extend the CFM algorithm by defining conditional probability paths along "streams", instances of latent stochastic paths that connect data pairs of source and target, which are modeled with Gaussian process (GP) distributions. The unique distributional properties of GPs help preserve the "simulation-free" nature of CFM training. We show that this generalization of the CFM can effectively reduce the variance in the estimated marginal vector field at a moderate computational cost, thereby improving the quality of the generated samples under common metrics. Additionally, adopting the GP on the streams allows for flexibly linking multiple correlated training data points (e.g., time series). We empirically validate our claim through both simulations and applications to image and neural time series data.
Paperid:584
Authors:Yuxin Dong · Haoran Guo · Tieliang Gong · Wen Wen · Chen Li
Abstract:
Abstract:Informationtheoretic bounds, while achieving significant success in analyzing the generalization of randomized learning algorithms, have been criticized for their slow convergence rates and overestimation. This paper presents novel bounds that bridge the expected empirical and population risks through a binarized variant of the Jensen-Shannon divergence. Leveraging our foundational lemma that characterizes the interaction between an arbitrary and a binary variable, we derive hypothesis-based bounds that enhance existing conditional mutual information bounds by reducing the number of conditioned samples from $2$ to $1$. We additionally establish prediction-based bounds that surpass prior bounds based on evaluated loss mutual information measures. Thereafter, through a new binarization technique for the evaluated loss variables, we obtain exactly tight generalization bounds broadly applicable to general randomized learning algorithms for any bounded loss functions. Our results effectively address key limitations of previous results in analyzing certain stochastic convex optimization problems, without requiring additional stability or compressibility assumptions about the learning algorithm.
Paperid:585
Authors:Xiaojing Du · Jiuyong Li · Debo Cheng · Lin Liu · Wentao Gao · XIONGREN CHEN · Ziqi Xu
Abstract:
Estimating causal effects is crucial for decisionmakers in many applications, but it is particularly challenging with observational network data due to peer interactions. Some algorithms have been proposed to estimate causal effects involving network data, particularly peer effects, but they often fail to tell apart diverse peer effects. To address this issue, we propose a general setting which considers both peer direct effects and peer indirect effects, and the effect of an individual's own treatment, and provide the identification conditions of these causal effects and their proofs. To differentiate these effects, we leverage causal mediation analysis and tailor it specifically for network data. Furthermore, given the inherent challenges of accurately estimating effects in networked environments, we propose to incorporate attention mechanisms to capture the varying influences of different neighbors and to explore high-order neighbor effects using multi-layer graph neural networks (GNNs). Additionally, we employ the Hilbert-Schmidt Independence Criterion (HSIC) to further enhance the model’s robustness and accuracy. Extensive experiments on two semi-synthetic datasets based on real-world networks, as well as on recommendation system data confirm the effectiveness of our approach. Our theoretical findings have the potential to improve intervention strategies in networked systems, particularly in social networks and public health.
Paperid:586
Authors:Yue Dai · Liang Liu · Xulong Tang · Youtao Zhang · Jun Yang
Abstract:
Temporal graph neural networks (TGNN) have achieved significant momentum in many realworld dynamic graph tasks.While most existing TGNN attack methods assume worst-case scenarios where attackers have complete knowledge of the input graph, the assumption may not always hold in real-world situations, where attackers can, at best, access information about existing nodes and edges but not future ones after the attack.However, studying adversarial attacks under these constraints is crucial, as limited future knowledge can reveal TGNN vulnerabilities overlooked in idealized settings.Nevertheless, designing effective attacks in such scenarios is challenging: the evolving graph can weaken their impact and make it hard to affect unseen nodes. To address these challenges, we introduce MemFreezing, a novel adversarial attack framework that delivers long-lasting and spreading disruptions in TGNNs without requiring post-attack knowledge of the graph. MemFreezing strategically injects fake nodes or edges to push node memories into a stable “frozen state,” reducing their responsiveness to subsequent graph changes and limiting their ability to convey meaningful information. As the graph evolves, these affected nodes maintain and propagate their frozen state through their neighbors. Experimental results show that MemFreezing persistently degrades TGNN performance across various tasks, offering a more enduring adversarial strategy under limited future knowledge.
Paperid:587
Authors:Rudrajit Das · Inderjit Dhillon · Alessandro Epasto · Adel Javanmard · Jieming Mao · Vahab Mirrokni · Sujay Sanghavi · Peilin Zhong
Abstract:
Abstract:The performance of a model trained with noisy labels is often improved by simply *retraining* the model with its *own predicted hard labels* (i.e., $1$/$0$ labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with local label differential privacy (DP) which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at no extra privacy cost; we call this consensusbased retraining. As an example, when training ResNet-18 on CIFAR-100 with $\epsilon=3$ label DP, we obtain more than $6$% improvement in accuracy with consensus-based retraining.
Paperid:588
Authors:Akira Ito · Masanori Yamada · Atsutoshi Kumagai
Abstract:
Ainsworth et al. empirically demonstrated that linear mode connectivity (LMC) can be achieved between two independently trained neural networks (NNs) by applying an appropriate parameter permutation. LMC is satisfied if a linear path with nonincreasing test loss exists between the models, suggesting that NNs trained with stochastic gradient descent (SGD) converge to a single approximately convex low-loss basin under permutation symmetries. However, Ainsworth et al. verified LMC for two models and provided only limited discussion on its extension to multiple models. In this paper, we conduct a more detailed empirical analysis. First, we show that existing permutation search methods designed for two models can fail to transfer multiple models into the same convex low-loss basin. Next, we propose a permutation search method using a straight-through estimator for multiple models (STE-MM). We then experimentally demonstrate that even when multiple models are given, the test loss of the merged model remains nearly the same as the losses of the original models when using STE-MM, and the loss barriers between all permuted model pairs are also small. Additionally, from the perspective of the trace of Hessian matrix, we show that the loss sharpness around the merged model decreases as the number of models increases with STE-MM, indicating that LMC for multiple models is more likely to hold.
Paperid:589
Authors:Pengyi Li · Jianye Hao · Hongyao Tang · Yifu Yuan · Jinbin Qiao · Zibin Dong · Yan Zheng
Abstract:
Reward functions is crucial for policy learning, Large language models (LLMs), with valuable taskrelated knowledge, provide an automated solution for high-quality reward design. However, reward function design is often inefficient and inaccuracy. To solve these problems, we propose an efficient automated reward design framework, called R.Rconsists of two main processes: reward structure evolution and parameter alignment optimization. To design high-quality reward structure, R* maintains a reward function population and modularizes the functional components. We employ LLMs as the mutation operator and propose module crossover for efficient exploration and exploitation.To design more accurate reward parameters, R* first leverages LLMs to generate multiple critic functions for trajectory comparison and annotation. A voting mechanism is then employed to collect the trajectory segments with highly confident labels.Subsequently, preference learning is applied to refine the reward function parameters based on the labeled data.Experiments on diverse robotic control tasks demonstrate that R* outperforms strong baselines in both reward design efficiency and quality, surpassing human-designed reward functions.
Paperid:590
Authors:Michael Zhang · Zhilin Wang · Jena Hwang · Yi Dong · Olivier Delalleau · Yejin Choi · Eunsol Choi · Xiang Ren · Valentina Pyatkin
Abstract:
We examine diverging preferences in humanlabeled preference datasets. We develop a taxonomy of disagreement sources spanning ten categories across four high-level classes, and find that the majority of disagreements are due to factors such as task Underspecification or response style. Our findings are in opposition with standard reward modeling approaches, which are designed with the assumption that annotator disagreement is noise. We then explore how these findings impact two areas of LLM development: reward modeling and evaluation. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, and LLM-as-Judge evaluation methods fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences. These findings highlight remaining challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence in evaluations and during LLM training.
Paperid:591
Authors:Yilun Kong · Guozheng Ma · Qi Zhao · Haoyu Wang · Li Shen · Xueqian Wang · Dacheng Tao
Abstract:
Despite recent advancements in offline multitask reinforcement learning (MTRL) have harnessed the powerful capabilities of the Transformer architecture, most approaches focus on a limited number of tasks, with scaling to extremely massive tasks remaining a formidable challenge. In this paper, we first revisit the key impact of task numbers on current MTRL method, and further reveal that naively expanding the parameters proves insufficient to counteract the performance degradation as the number of tasks escalates. Building upon these insights, we propose M3DT, a novel mixture-of-experts (MoE) framework that tackles task scalability by further unlocking the model’s parameter scalability. Specifically, we enhance both the architecture and the optimization of the agent, where we strengthen the Decision Transformer (DT) backbone with MoE to reduce task load on parameter subsets, and introduce a three-stage training mechanism to facilitate efficient training with optimal performance. Experimental results show that, by increasing the number of experts, M3DT not only consistently enhances its performance as model expansion on the fixed task numbers, but also exhibits remarkable task scalability, successfully extending to 160 tasks with superior performance.
Paperid:592
Authors:Nguyen Nhat Minh To · Paul Wilson · Viet Nguyen · Mohamed Harmanani · Michael Cooper · Fahimeh Fooladgar · Purang Abolmaesumi · Parvin Mousavi · Rahul G. Krishnan
Abstract:
Subpopulation shifts, characterized by disparities in subpopulation distributions between training and target datasets, can significantly degrade the performance of the machine learning model. Current solutions to subpopulation shifts often involve modifying empirical risk minimization with reweighting strategies to improve generalization across subpopulations. This strategy often relies on assumptions about the number and nature of subpopulations and annotations of subpopulation membership, which are unavailable for many real-world datasets. We propose a new solution to heuristically explore the feature subspace: we train many classifiers while enforcing diversification to promote the discovery and correct classification of new subpopulations without requiring prior knowledge of the subpopulations. Given a feature extractor network, we replace its standard linear layer with a mixture of prototypical classifiers, where each member is trained to classify the data while focusing on different features and samples compared to the other members. We demonstrate that our solution outperforms the prior state-of-the-art in worst-group accuracy on most benchmarks using empirical evaluations across nine real-world datasets covering diverse domains and subpopulation shift types (code is available at https://anonymous.4open.science/r/prototypical_ensembles-BCB3).
Paperid:593
Authors:Walter McKelvie · Jasper C.H. Lee · Maoyuan Song · Paul Valiant
Abstract:
Abstract:We consider the basic statistical challenge of designing an ''allpurpose'' mean estimation algorithm that is recommendable across a variety of settings and models.Recent work by \citet{Lee:2022} introduced the first 1-d mean estimator whose error in the standard finite-variance+i.i.d. setting is optimal even in its constant factors; experimental demonstration of its good performance was shown by \citet{Gobet:2022}.Yet, unlike for classic (but not necessarily practical) estimators such as median-of-means and trimmed mean, this new algorithm lacked proven robustness guarantees in other settings, including the settings of adversarial data corruption and heavy-tailed distributions with infinite variance.Such robustness is important for practical use cases.This raises an in important research question: is it possible to have a mean estimator that is robust, *without* sacrificing provably optimal performance in the standard i.i.d. setting?In this work, we show that Lee and Valiant's estimator is in fact an ''all-purpose'' mean estimator by proving:- It is robust to an $\eta$-fraction of data corruption, even in the strong contamination model; it has optimal estimation error $O(\sigma\sqrt{\eta})$ for distributions with variance~$\sigma^2$.- For distributions with finite $z^\text{th}$ moment, for $z \in (1,2)$, it has optimal estimation error, matching the lower bounds of \citet{Devroye:2016} up to constants.
Paperid:594
Authors:Gregoire Fournier · Sourav Medya
Abstract:
Graph neural networks (GNNs) have been widely used in various domains such as social networks, molecular biology, or recommendation systems. Concurrently, different explanations methods of GNNs have arisen to complement its blackbox nature. Explanations of the GNNs’ predictions can be categorized into two types—factual and counterfactual. Given a GNN trained on binary classification into “accept” and “reject” classes, a global counterfactual explanation consists in generating a small set of “accept” graphs relevant to all of the input “reject” graphs. The transformation of a “reject” graph into an “accept” graph is called a recourse. A common recourse explanation is a small set of recourse, from which every “reject” graph can be turned into an “accept” graph. Although local counterfactual explanations have been studied extensively, the problem of finding common recourse for global counterfactual explanation remains unexplored, particularly for GNNs. In this paper, we formalize the common recourse explanation problem, and design an effective algorithm, COMRECGC, to solve it. We benchmark our algorithm against strong baselines on four different realworld graphs datasets and demonstrate the superior performance of COMRECGC against the competitors. We also compare the common recourse explanations to the graph counterfactual explanation, showing that common recourse explanations are either comparable or superior, making them worth considering for applications such as drug discovery or computational biology.
Paperid:595
Authors:Hamza Cherkaoui · Merwan Barlier · Igor Colin
Abstract:
The multiagent linear bandit setting is a well-known setting for which designing efficient collaboration between agents remains challenging. This paper studies the impact of data sharing among agents on regret minimization. Unlike most existing approaches, our contribution does not rely on any assumptions on the bandit parameters structure. Our main result formalizes the trade-off between the bias and uncertainty of the bandit parameter estimation for efficient collaboration. This result is the cornerstone of the Bandit Adaptive Sample Sharing (BASS) algorithm, whose efficiency over the current state-of-the-art is validated through both theoretical analysis and empirical evaluations on both synthetic and real-world datasets. Furthermore, we demonstrate that, when agents' parameters display a cluster structure, our algorithm accurately recovers them.
Paperid:596
Authors:Hyeongwon Jang · Changhun Kim · Eunho Yang
Abstract:
Recent explainable AI (XAI) methods for time series primarily estimate pointwise attribution magnitudes, while overlooking the directional (positive/negative) impact on model predictions, leading to suboptimal identification of significant points. With regard to this, our analysis demonstrates that conventional Integrated Gradients (IG) effectively capture critical points that have both positive and negative impacts on model predictions. However, existing evaluation metrics fail to adequately recognize this capability. To address this limitation, we first propose novel evaluation metricsCumulative Prediction Difference (CPD) and Cumulative Prediction Preservation (CPP)-designed to rigorously assess whether attribution methods accurately identify significant positive and negative points in time series XAI. Under these metrics, we discover that conventional integrated gradients (IG) outperform their recent counterparts. However, directly applying IG to time series data may lead to suboptimal outcomes, as the generated paths do not take temporal relationships into account and introduce out-of-distribution samples. To overcome these challenges, we introduce \timing, which enhances IG by incorporating temporal awareness while maintaining its theoretical properties. Extensive experiments on synthetic and real-world time series benchmarks demonstrate that \timing\ outperforms existing time series XAI baselines.
Paperid:597
Authors:Saaketh Narayan · Abhay Gupta · Mansheej Paul · Davis Blalock
Abstract:
Large Language Model training with 8bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging. It is currently possible to train in FP8 only if one is willing to tune various hyperparameters, reduce model scale, or accept the overhead of computing dynamic scale factors.We demonstrate simple, scalable FP8 training that requires no dynamic scaling factors or special hyperparameters, even at large model sizes. Our method,µnit Scaling (µS), enables simple hyperparameter transfer across model widths, matched numerics across training and inference, and other desirable properties. In particular, µnit Scaling is straightforward to implement, consisting of a set of minimal interventions based on a first-principles analysis of common transformer operations.We validate our method by successfully training models from 1B to 13B parameters, performing all hidden linear layer computations in FP8. We achieve quality equal to higher precision baselines while also training up to 33% faster.
Paperid:598
Authors:Xuesong Wang · He Zhao · Edwin V. Bonilla
Abstract:
Neural Processes (NPs) are deep probabilistic models that represent stochastic processes by conditioning their prior distributions on a set of context points. Despite their advantages in uncertainty estimation for complex distributions, NPs enforce parameterization coupling between the conditional prior model and the posterior model. We show that this coupling amounts to prior misspecification and revisit the NP objective to address this issue. More specifically, we propose Rényi Neural Processes (RNP), a method that replaces the standard KL divergence with the Rényi divergence, dampening the effects of the misspecified prior during posterior updates. We validate our approach across multiple benchmarks including regression and image inpainting tasks, and show significant performance improvements of RNPs in realworld problems. Our extensive experiments show consistently better log-likelihoods over state-of-the-art NP models.
Paperid:599
Authors:Xunkai Li · Zhengyu Wu · Kaichi Yu · Hongchao Qin · Guang Zeng · Rong-Hua Li · Guoren Wang
Abstract:
The directed graph (digraph), as a generalization of undirected graphs, exhibits superior representation capability in modeling complex topology systems and has garnered considerable attention in recent years. Despite the notable efforts made by existing DiGraph Neural Networks (DiGNNs) to leverage directed edges, they still fail to comprehensively delve into the abundant data knowledge concealed in the digraphs. This datalevel limitation results in model-level sub-optimal predictive performance and underscores the necessity of further exploring the potential correlations between the directed edges (topology) and node profiles (feature and labels) from a data-centric perspective, thereby empowering model-centric neural networks with stronger encoding capabilities. In this paper, we propose \textbf{E}ntropy-driven \textbf{D}igraph knowl\textbf{E}dge distillatio\textbf{N} (EDEN), which can serve as a data-centric digraph learning paradigm or a model-agnostic hot-and-plug data-centric Knowledge Distillation (KD) module. The core idea is to achieve data-centric ML, guided by our proposed hierarchical encoding theory for structured data. Specifically, EDEN first utilizes directed structural measurements from a topology perspective to construct a coarse-grained Hierarchical Knowledge Tree (HKT). Subsequently, EDEN quantifies the mutual information of node profiles to refine knowledge flow in the HKT, enabling data-centric KD supervision within model training. As a general framework, EDEN can also naturally extend to undirected scenarios and demonstrate satisfactory performance. In our experiments, EDEN has been widely evaluated on 14 (di)graph datasets (homophily and heterophily) and across 4 downstream tasks. The results demonstrate that EDEN attains SOTA performance and exhibits strong improvement for prevalent (Di)GNNs.
Paperid:600
Authors:Guner Dilsad ER · Sebastian Trimpe · Michael Muehlebach
Abstract:
We consider a distributed learning problem, where agents minimize a global objective function by exchanging information over a network. Our approach has two distinct features: (i) It substantially reduces communication by triggering communication only when necessary, and (ii) it is agnostic to the datadistribution among the different agents. We can therefore guarantee convergence even if the local data-distributions of the agents are arbitrarily distinct. We analyze the convergence rate of the algorithm both in convex and nonconvex settings and derive accelerated convergence rates in a convex setting. We also characterize the effect of communication failures and demonstrate that our algorithm is robust to communication failures. The article concludes by presenting numerical results from distributed learning tasks on the MNIST and CIFAR-10 datasets. The experiments underline communication savings of 35\% or more due to the event-based communication strategy, show resilience towards heterogeneous data-distributions, and highlight that our approach outperforms common baselines such as FedAvg, FedProx, SCAFFOLD and FedADMM.
Paperid:601
Authors:Taehyun Cho · Seokhun Ju · Seungyub Han · Dohyeong Kim · Kyungjae Lee · Jungwoo Lee
Abstract:
To design reward that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing models using reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. To address this, we propose Policylabeled Preference Learning (PPL) within the Direct Preference Optimization (DPO) framework, which resolves these likelihood mismatch problems by modeling human preferences with regret, reflecting the efficiency of executed policies. Additionally, we introduce a contrastive KL regularization term derived from regret-based principles to enhance sequential contrastive learning. Experiments in high-dimensional continuous control environments demonstrate PPL's significant improvements in offline RLHF performance and its effectiveness in online settings.
Paperid:602
Authors:Chansophea Wathanak In · Yi Li · David Woodruff · Xuan Wu
Abstract:
Abstract:Robustness to outliers is important in machine learning. Many classical problems, including subspace embedding, clustering, and lowrank approximation, lack scalable, outlier-resilient algorithms. This paper considers machine learning problems of the form $\min_{x\in \mathbb{R}^d} F(x)$, where $F(x)=\sum_{i=1}^n F_i(x)$, and their robust counterparts $\min_{x\in\mathbb{R}^d} F^{(m)}(x)$, where $F^{(m)}(x)$ denotes the sum of all but the $m$ largest $F_i(x)$ values. We develop a general framework for constructing $\epsilon$-coresets for such robust problems, where an $\epsilon$-coreset is a weighted subset of functions $\{F_1(x),\dots,F_n(x)\}$ that provides a $(1+\epsilon)$-approximation to $F(x)$. Specifically, if the original problem $F$ has total sensitivity $T$ and admits a vanilla $\epsilon$-coreset of size $S$, our algorithm constructs an $\epsilon$-coreset of size $\tilde{O}(\frac{mT}{\epsilon})+S$ for the robust objective $F^{(m)}$. This coreset size can be shown to be near-tight for $\ell_2$ subspace embedding. Our coreset algorithm has scalable running time and leads to new or improved algorithms for the robust optimization problems. Empirical evaluations demonstrate that our coresets outperform uniform sampling on real-world data sets.
Paperid:603
Authors:Yuyang Wang · Anurag Ranjan · Joshua M Susskind · Miguel Angel Bautista Martin
Abstract:
Flow Matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on irregular or unstructured data like 3D point clouds or even protein structures. These models are commonly trained in two stages: first, a data compressor is trained, and in a subsequent training stage a flow matching generative model is trained in the latent space of the data compressor. This twostage paradigm sets obstacles for unifying models across data domains, as hand-crafted compressors architectures are used for different data modalities. To this end, we introduce INRFlow, a domain-agnostic approach to learn flow matching transformers directly in ambient space. Drawing inspiration from INRs, we introduce a conditionally independent point-wise training objective that enables INRFlow to make predictions continuously in coordinate space. Our empirical results demonstrate that INRFlow effectively handles different data modalities such as images, 3D point clouds and protein structure data, achieving strong performance in all domains and outperforming comparable approaches. {\model} is a promising step towards domain-agnostic flow matching generative models that can be trivially adopted in different data domains.
Paperid:604
Authors:Sabri Eyuboglu · Avanika Narayan · Dan Biderman · Avner May · Scott Linderman · James Zou · Christopher Re
Abstract:
We investigate an emerging setup in which a small, ondevice language model (LM) with access to local data collaborates with a frontier, cloud-hosted LM to solve real-world tasks involving financial, medical, and scientific reasoning over long documents.Can a local-remote collaboration reduce cloud inference costs while preserving quality?First, we consider a naïve communication protocol where the local and remote models simply chat back and forth. Because only the local model reads the full context, this protocol achieves a 30.4× reduction in remote costs, but fails to recover the performance of the frontier model.We identify two key limitations of this protocol: the local model struggles to (1) follow the remote model's multi-step instructions and (2) reason over long contexts. Motivated by these observations, we study an extension of this protocol, coined Minions, in which the remote model decomposes the task into easier subtasks over shorter chunks of the document, that are executed in-parallel locally. Minions reduces costs by 5.7× on average while recovering 97.9% of the performance of the remote model alone.Our analysis reveals several key design choices that influence the trade-off between cost and performance in local-remote systems.
Paperid:605
Authors:Tommaso Mencattini · Adrian Robert Minut · Donato Crisostomi · Andrea Santilli · Emanuele Rodola
Abstract:
Abstract:Evolutionary model merging enables the creation of highperforming multi-task models but remains computationally prohibitive for consumer hardware. We introduce MERGE, an efficient framework that makes evolutionary merging feasible on a single GPU by reducing fitness computation costs 50× while preserving performance. MERGE$^3$ achieves this by **E**xtracting a reduced dataset for evaluation, **E**stimating model abilities using Item Response Theory (IRT), and **E**volving optimal merges via IRT-based performance estimators. Our method enables state-of-the-art multilingual and cross-lingual merging, transferring knowledge across languages with significantly lower computational overhead. We provide theoretical guarantees and an open-source library, democratizing high-quality model merging.
Paperid:606
Authors:Saksham Rastogi · Pratyush Maini · Danish Pruthi
Abstract:
Given how large parts of the publicly available text are crawled to pretrain large language models (LLMs), creators increasingly worry about the inclusion of their proprietary data for model training without attribution or licensing. Their concerns are also shared by benchmark curators whose testsets might be compromised. In this paper, we present STAMP, a framework for detecting dataset membership—i.e., determining the inclusion of a dataset in the pretraining corpora of LLMs. Given an original piece of content, our proposal involves generating multiple watermarked rephrases such that a distinct watermark is embedded in each rephrasing. One version is released publicly while others are kept private. Subsequently, creators can compare model likelihoods between public and private versions using paired statistical tests to prove membership. We show that our framework can successfully detect contamination across four benchmarks which appear only once in the training data and constitute less than 0.001% of the total tokens, outperforming several contamination detection and dataset inference baselines. We verify that our approach preserves both the semantic meaning and the utility of benchmarks in comparing different models. We apply STAMP to two real-world scenarios to confirm the inclusion of paper abstracts and blog articles in the pretraining corpora.
Paperid:607
Authors:Huadai Liu · Tianyi Luo · Qikai Jiang · Kaicheng Luo · Peiwen Sun · Jialei Wang · Rongjie Huang · Qian Chen · Wen Wang · Xiangtai Li · ShiLiang Zhang · Zhijie Yan · Zhou Zhao · Wei Xue
Abstract:
Traditional videoto-audio generation techniques primarily focus on field-of-view (FoV) video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, \textbf{360V2SA}, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create \textbf{Sphere360}, a novel dataset tailored for this task that is curated from real-world data. We also design an efficient semi-automated pipeline for collecting and cleaning paired video-audio data. To generate spatial audio from 360-degree video, we propose a novel framework \textbf{AudioSpace}, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data. Furthermore, AudioSpace features a dual-branch framework that utilizes both panoramic and FoV video inputs to capture comprehensive local and global information from 360-degree videos. Experimental results demonstrate that AudioSpace achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets will be released. The demo page is available at~\url{https://AudioSpace-360V2SA.github.io}.
Paperid:608
Authors:Qi Xu · Junyang Zhu · Dongdong Zhou · Hao Chen · Yang Liu · Jiangrong Shen · Qiang Zhang
Abstract:
Deep neural networks (DNNs) excel in computer vision tasks, especially, fewshot learning (FSL), which is increasingly important for generalizing from limited examples. However, DNNs are computationally expensive with scalability issues in real world. Spiking Neural Networks (SNNs), with their event-driven nature and low energy consumption, are particularly efficient in processing sparse and dynamic data, though they still encounter difficulties in capturing complex spatiotemporal features and performing accurate cross-class comparisons. To further enhance the performance and efficiency of SNNs in few-shot learning, we propose a few-shot learning framework based on SNNs, which combines a self-feature extractor module and a cross-feature contrastive module to refine feature representation and reduce power consumption. We apply the combination of temporal efficient training loss and InfoNCE loss to optimize the temporal dynamics of spike trains and enhance the discriminative power. Experimental results show that the proposed FSL-SNN significantly improves the classification performance on the neuromorphic dataset N-Omniglot, and also achieves competitive performance to ANNs on static datasets such as CUB and miniImageNet with low power consumption.
Paperid:609
Authors:Zhiheng Zhang · Haoxiang Wang · Haoxuan Li · Zhouchen Lin
Abstract:
The estimation of the treatment effect is currently receiving a great deal of attention.Efforts have been made to recover asymptotic property for the treatment effect from largesamples. In practice, however, we often do not have access to the potential outcome ofall samples; instead, we can only sample a fraction of the entire population for causaleffect detection. In this paper, we consider the problem of treatment effect estimation inthe sampleconstrained case. We propose a method to select samples for treatment effectestimation with linear sample complexity with respect to d.
Paperid:610
Authors:Guangtao Zheng · Wenqian Ye · Aidong Zhang
Abstract:
Deep neural networks often develop spurious bias, reliance on correlations between nonessential features and classes for predictions. For example, a model may identify objects based on frequently co-occurring backgrounds rather than intrinsic features, resulting in degraded performance on data lacking these correlations. Existing mitigation approaches typically depend on external annotations of spurious correlations, which may be difficult to obtain and are not relevant to the spurious bias in a model. In this paper, we take a step towards self-guided mitigation of spurious bias by proposing NeuronTune, a post hoc method that directly intervenes in a model’s internal decision process. Our method probes in a model's latent embedding space to identify and regulate neurons that lead to spurious prediction behaviors. We theoretically justify our approach and show that it brings the model closer to an unbiased one. Unlike previous methods, NeuronTune operates without requiring spurious correlation annotations, making it a practical and effective tool for improving model robustness. Experiments across different architectures and data modalities demonstrate that our method significantly mitigates spurious bias in a self-guided way.
Paperid:611
Authors:Nils Palumbo · Ravi Mangal · Zifan Wang · Saranya Vijayakumar · Corina Pasareanu · Somesh Jha
Abstract:
Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of amechanistic interpretationitself is often adhoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.
Paperid:612
Authors:Etienne Boursier · Nicolas Flammarion
Abstract:
Understanding generalization of overparametrized models remains a fundamental challenge in machine learning. The literature mostly studies generalization from an interpolation point of view, taking convergence towards a global minimum of the training loss for granted. This interpolation paradigm does not seem valid for complex tasks such as incontext learning or diffusion. It has instead been empirically observed that the trained models go from global minima to spurious local minima of the training loss as the number of training samples becomes larger than some level we call optimization threshold. This paper explores theoretically this phenomenon in the context of two-layer ReLU networks. We demonstrate that, despite overparametrization, networks might converge towards simpler solutions rather than interpolating training data, which leads to a drastic improvement on the test loss. Our analysis relies on the so called early alignment phase, during which neurons align toward specific directions. This directional alignment leads to a simplicity bias, wherein the network approximates the ground truth model without converging to the global minimum of the training loss. Our results suggest this bias, resulting in an optimization threshold from which interpolation is not reached anymore, is beneficial and enhances the generalization of trained models.
Paperid:613
Authors:Yuri Gardinazzi · Giada Panerai · Karthik Viswanathan · Alessio Ansuini · Alberto Cazzaniga · Matteo Biagetti
Abstract:
Abstract:Understanding the decisionmaking processes of large language models is critical given their widespread applications. To achieve this, we aim to connect a formal mathematical framework—zigzag persistence from topological data analysis —with practical and easily applicable algorithms. Zigzag persistence is particularly effective for characterizing data as it dynamically transforms across model layers. Within this framework, we introduce topological descriptors that measure how topological features, $p$-dimensional holes, persist and evolve throughout the layers. Unlike methods that assess each layer individually and then aggregate the results, our approach directly tracks the full evolutionary path of these features. This offers a statistical perspective on how prompts are rearranged and their relative positions changed in the representation space, providing insights into the system’s operation as an integrated whole. To demonstrate the expressivity and applicability of our framework, we highlight how sensitive these descriptors are to different models and a variety of datasets. As a showcase application to a downstream task, we use zigzag persistence to establish a criterion for layer pruning, achieving results comparable to state-of-the-art methods while preserving the system-level perspective.
Paperid:614
Authors:Raanan Yehezkel Rohekar · Yaniv Gurwicz · Sungduk Yu · Estelle Aflalo Guez · Vasudev Lal
Abstract:
Are generative pretrained transformer (GPT) models, trained only to predict the next token, implicitly learn a world model from which a sequence is generated one token at a time? We examine this question by deriving a causal interpretation of the attention mechanism in GPT, suggesting a causal world model that arises from this interpretation. Furthermore, we propose that GPT models, at inference time, can be utilized for zero-shot causal structure learning for input sequences and present a confidence score. Empirical evaluation is conducted in a controlled environment using the setup and rules of the Othello and Chess strategy games. A GPT, pre-trained on real-world games played with the intention of winning, is tested on out-of-distribution synthetic data consisting of sequences of random legal moves. We find that the GPT model is likely to generate legal next moves for out-of-distribution sequences for which a causal structure is encoded in the attention mechanism with high confidence. In cases for which the GPT model generates illegal moves it also fails to capture any causal structure.
Paperid:615
Authors:Yulun Wu · Doron Bergman
Abstract:
We present an Adversarially Pretrained Transformer (APT) that is able to perform zero-shot meta-learning on tabular prediction tasks without using any real-world dataset to pre-train the model, extending on the recent development of Prior-Data Fitted Networks (PFNs) and TabPFN. Specifically, APT is pre-trained with adversarial synthetic data agents, who continue to shift their underlying data generating distribution and deliberately challenge the model with different synthetic datasets. In addition, we propose a mixture block model architecture that is able to handle classification tasks with arbitrary number of classes, addressing the class size limitation -- a crucial weakness of prior tabular zero-shot learning algorithms. In experiments, we show that our framework matches state-of-the-art performance on small tabular classification tasks without filtering on dataset characteristics such as number of classes and number of missing values, while maintaining an average runtime under one second. On common benchmark dataset suites in both classification and regression, we show that adversarial pre-training was able to enhance TabPFN's performance. In our analysis, we demonstrate that the adversarial synthetic data agents were able to generate a more diverse collection of data compared to the ordinary random generator in TabPFN. In addition, we demonstrate that our mixture block neural design has improved generalizability and greatly accelerated pre-training.
Paperid:616
Authors:Yijun Dong · Yicheng Li · Yunai Li · Jason Lee · Qi Lei
Abstract:
Abstract:Weakto-strong (W2S) generalization is a type of finetuning (FT) where a strong (large) student model is trained on pseudo-labels generated by a weak teacher. Surprisingly, W2S FT often outperforms the weak teacher. We seek to understand this phenomenon by leveraging the observation that FT often occurs in intrinsically low-dimensional space. Leveraging the low intrinsic dimensionality of FT, we analyze W2S in the ridgeless regression setting from a variance reduction perspective. For a strong student - weak teacher pair with sufficiently expressive low-dimensional feature subspaces $\mathcal{V}_s, \mathcal{V}_w$, we provide an exact characterization of the variance that dominates the generalization error of W2S. This unveils a surprising virtue of discrepancy between the strong and weak models in W2S: the variance of the weak teacher is inherited by the strong student in $\mathcal{V}_s \cap \mathcal{V}_w$, while reduced by a factor of $\dim(\mathcal{V}_s)/N$ in the subspace of discrepancy $\mathcal{V}_w \setminus \mathcal{V}_s$ with $N$ pseudo-labels for W2S. Further, our analysis casts light on the sample complexities and the scaling of performance gap recovery in W2S. The analysis is supported with experiments on both synthetic regression problems and real image classification tasks.
Paperid:617
Authors:Quentin Bertrand · Juan Duque · Emilio Calvano · Gauthier Gidel
Abstract:
Abstract:The deployment of machine learning systems in the market economy has triggered academic and institutional fears over potential tacit collusion between fully automated agents. Multiple recent economics studies have empirically shown the emergence of collusive strategies from agents guided by machine learning algorithms. In this work, we prove that multiagent $Q$-learners playing the iterated prisoner's dilemma can learn to collude. The complexity of the cooperative multi-agent setting yields multiple fixed-point policies for $Q$-learning: the main technical contribution of this work is to characterize the convergence towards a specific cooperative policy. More precisely, in the iterated prisoner's dilemma, we show that with optimistic $Q$-values, any self-play $Q$-learner can provably learn a cooperative policy called Pavlov, also known as win-stay, lose-shift policy, which strongly differs from the vanilla Pareto dominated the always defect policy.
Paperid:618
Authors:Yongqiang Yao · Jingru Tan · Feizhao Zhang · Jiahao Hu · Yazhe Niu · Bo Li · JinXin · Ruihao Gong · Pengfei Liu · Dahua Lin · Ningyi Xu
Abstract:
Visionlanguage instruct-tuning models have recently made significant progress due to their more comprehensive understanding of the world. In this work, we discover that large-scale 3D parallel training on those models leads to an imbalanced computation load across different devices. The vision and language parts are inherently heterogeneous: their data distribution and model architecture differ significantly, which affects distributed training efficiency. To address this issue, we rebalance the computational load from data, model, and memory perspectives, achieving more balanced computation across devices. Specifically, for the data, instances are grouped into new balanced mini-batches within and across devices. A search-based method is employed for the model to achieve a more balanced partitioning. For memory optimization, we adaptively adjust the re-computation strategy for each partition to utilize the available memory fully. These three perspectives are not independent but are closely connected, forming an omniverse balanced training framework. extensive experiments are conducted to validate the effectiveness of our method. Compared with the open-source training code of InternVL-Chat, training time is reduced greatly, achieving about 1.8x speed-up. Our method's efficacy and generalizability are further validated across various models and datasets. Codes will be released at https://anonymous.4open.science/r/omnibal_example-E4D7.
Paperid:619
Authors:Jiaming Ma · Binwu Wang · Pengkun Wang · Zhengyang Zhou · Xu Wang · Yang Wang
Abstract:
Recently, spatiotemporal graph convolutional networks have achieved dominant performance in spatiotemporal prediction tasks. However, most models relying on nodeto-node messaging interaction exhibit sensitivity to spatiotemporal shifts, encountering out-of-distribution (OOD) challenges. To address these issues, we introduce Spatio-Temporal OOD Processor (STOP), which employs a centralized messaging mechanism alongside a message perturbation mechanism to facilitate robust spatiotemporal interactions. Specifically, the centralized messaging mechanism integrates Context-Aware Units for coarse-grained spatiotemporal feature interactions with nodes, effectively blocking traditional node-to-node messages. We also implement a message perturbation mechanism to disrupt this messaging process, compelling the model to extract generalizable contextual features from generated variant environments. Finally, we customize a spatiotemporal distributionally robust optimization approach that exposes the model to challenging environments, thereby further enhancing its generalization capabilities. Compared with 14 baselines across six datasets, STOP achieves up to 17.01% improvement in generalization performance and 18.44\% improvement in inductive learning performance.
Paperid:620
Authors:Eric Frankel · Kshitij Kulkarni · Dmitriy Drusvyatskiy · Sewoong Oh · Lillian Ratliff
Abstract:
Abstract:Decisionmakers often adaptively influence downstream competitive agents' behavior to minimize their cost, yet in doing so face critical challenges: $(i)$ decision-makers might not \emph{a priori} know the agents' objectives; $(ii)$ agents might \emph{learn} their response, introducing stochasticity into the decision-making process; and $(iii)$ there may be additional non-strategic environmental stochasticity. Characterizing convergence of this complex system is contingent on how the decision-maker accounts for the induced drift and additional noise from the induced learning agent behavior and decision-dependence from environmental stochasticity.To understand how the learning agents' behavior is influenced bythe decision-maker’s actions, we first consider a decision-maker that deploys an arbitrary sequence ofactions which induce a sequence of games and correspondingequilibria. We characterize how the drift and noise in the agents' stochastic algorithms decouples from the optimization error.Leveraging this decoupling and accompanying efficiency estimates, we design decision-maker algorithms that control the induced drift relative to the agent noise. This enables efficient tracking of game theoretic equilibrium concepts.To this end, we establish a hierarchy of reasonable interaction models that allow progressively more gradient information, and give convergence guarantees.
Paperid:621
Authors:Fan Shi · Bin Li · Xiangyang Xue
Abstract:
Abstract visual reasoning (AVR) enables humans to discover and quickly generalize abstract rules to new scenarios. In the field of artificial intelligence, designing intelligent models with humanlike AVR abilities has been a long-standing research topic. Numerous deep AVR solvers have recently achieved remarkable success across various AVR tasks. However, these solvers usually require task-specific designs or training parameters for different reasoning tasks. In this design paradigm, solving a new task often means retraining the model, and sometimes retuning the training parameters or model architectures, thereby increasing the cost of solving problems. In contrast to task-specific approaches, this paper proposes a novel framework UCGS for abstract visual reasoning, aiming to address multiple types of reasoning problems in a unified manner. We prove that some well-known AVR problems (e.g., Raven's Progressive Matrices, Visual Analogy Problems, Odd-One-Out, etc) can be solved by estimating the conditional probability distribution of the target images within problem panels. By training a unified conditional generative model, we can solve various AVR problems without retraining. The experiments show that with a single round of multi-task training, UCGS can successfully solve various AVR tasks. Further experiments reveal that UCGS exhibits the ability of zero-shot reasoning, enabling it to solve unseen types of AVR problems in the testing phase.
Paperid:622
Authors:Haeseong Yang · Jourdain Lamperski · Oleg Prokopyev
Abstract:
Abstract:We consider the *maxmin eigenvalue augmentation* problem: given $n \times n$ symmetric positive semidefinite matrices $M,A_1,\ldots, A_m$ and a positive integer $k < m$, the goal is to choose a subset $I \subset \{1,\ldots,m\}$ of cardinality at most $k$ that maximizes the minimum eigenvalue of the matrix $M + \sum_{i \in I} A_i$. The problem captures both the *Bayesian E-optimal design* and *maximum algebraic connectivity augmentation* problems. In contrast to the existing work, we do not assume that the *augmentation matrices* are rank-one matrices, and we focus on the setting in which $k < n$. We show that a *simple* randomized rounding method provides a constant-factor approximation if the *optimal increase* is sufficiently large, specifically, if $\mathrm{OPT} - \lambda_{\mathrm{min}}(M) = \Omega(R \ln k)$, where $\mathrm{OPT}$ is the optimal value, and $R$ is the maximum trace of an augmentation matrix. To establish the guarantee, we derive a matrix concentration inequality that is of independent interest. The inequality can be interpreted as an *intrinsic dimension* analog of the matrix Chernoff inequality for the minimum eigenvalue of a sum of independent random positive semidefinite matrices; such an inequality has already been established for the maximum eigenvalue, but not for the minimum eigenvalue.
Paperid:623
Authors:Zhengzhao Pan · Xiaogang Zhang · Hua Chen
Abstract:
Unrestricted adversarial examples(UAEs) have posed greater threats to deep neural networks(DNNs) than perturbationbased adversarial examples(AEs) because they can make extensive changes to images without being restricted in a fixed norm perturbation budget. Although current diffusion-based methods can generate more natural UAEs than other unrestricted attack methods, the overall effectiveness of such methods is restricted since they are designed for specific attack conditions. Additionally, the naturalness of UAEs still has room for improvement, as these methods primarily focus on leveraging diffusion models as strong priors to enhance the generation process. This paper proposes a flexible framework named Diffusion-based Adversarial Maximum a Posterior(DiffAdvMAP) to generate more natural UAEs for various scenarios. DiffAdvMAP approaches the generation of UAEs by sampling images from posterior distributions, which is achieved by approximating the posterior distribution of UAEs using the prior distribution of real data learned by the diffusion model. This process enhances the naturalness of the UAEs. By incorporating an adversarial constraint to ensure the effectiveness of the attack, DiffAdvMAP exhibits excellent attack ability and defense robustness. A reconstruction constraint is designed to enhance its flexibility, which allows DiffAdvMAP to be tailored to various attack scenarios. Experimental results on Imagenet show that we achieve a better trade-off between image quality, flexibility, and transferability than baseline unrestricted adversarial attack methods.
Paperid:624
Authors:Shuo Han · German Espinosa · Junda Huang · Malcolm MacIver · Daniel A. Dombeck · Bradly Stadie
Abstract:
Recent advances in reinforcement learning (RL) have demonstrated impressive capabilities in complex decisionmaking tasks. This progress raises a natural question: how do these artificial systems compare to biological agents, which have been shaped by millions of years of evolution? To help answer this question, we undertake a comparative study of biological mice and RL agents in a predator-avoidance maze environment. Through this analysis, we identify a striking disparity: RL agents consistently demonstrate a lack of self-preservation instinct, readily risking ``death'' for marginal efficiency gains. These risk-taking strategies are in contrast to biological agents, which exhibit sophisticated risk-assessment and avoidance behaviors. Towards bridging this gap between the biological and artificial, we propose two novel mechanisms that encourage more naturalistic risk-avoidance behaviors in RL agents. Our approach leads to the emergence of naturalistic behaviors, including strategic environment assessment, cautious path planning, and predator avoidance patterns that closely mirror those observed in biological systems. Empirical results show that the new RL agents trained with our framework demonstrate behavioral patterns significantly closer to biological mice while maintaining task performance.
Paperid:625
Authors:André Duarte · Xuandong Zhao · Arlindo Oliveira · Lei Li
Abstract:
How can we verify whether copyrighted content was used to train a large visionlanguage model (VLM) without direct access to its training data?Motivated by the hypothesis that a VLM is able to recognize images from its training corpus, we propose DIS-CO, a novel approach to infer the inclusion of copyrighted content during the model's development. By repeatedly querying a VLM with specific frames from targeted copyrighted material, DIS-CO extracts the content's identity through free-form text completions. To assess its effectiveness, we introduce MovieTection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model’s training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. Our findings also highlight a broader concern: all tested models appear to have been exposed to some extent to copyrighted content. We provide the code in the supplementary materials.
Paperid:626
Authors:Yaolong Yu · Fan Yao · Sinno Jialin Pan
Abstract:
Abstract:We employ a gametheoretic framework to study the impact of a specific strategic behavior among creators---group behavior---on recommendation systems. In this setting, creators within a group collaborate to maximize their collective utility.We show that group behavior has a limited effect on the game equilibrium when the group size is small. However, when the group size is large, group behavior can significantly alter content distribution and user welfare.Specifically, in a top-$K$ recommendation system with exposure-based rewards, we demonstrate that user welfare can suffer a significant loss due to group strategies, and user welfare does not necessarily increase with larger values of $K$ or more random matching, contrasting sharply with the individual creator case. Furthermore, we investigate user welfare guarantees through the lens of the Price of Anarchy (PoA). In the general case, we establish a negative result on the bound of PoA with exposure rewards, proving that it can be arbitrarily large. We then investigate a user engagement rewarding mechanism, which mitigates the issues caused by large group behavior, showing that $\text{PoA}\leq K+1$ in the general case and $\text{PoA}\leq 2$ in the binary case. Empirical results from simulations further support the effectiveness of the user engagement rewarding mechanism.
Paperid:627
Authors:Tuan Truong · Quyen Tran · Ngọc Quân Phạm · Dinh Phung · Nhat Ho · Trung Le
Abstract:
We introduce Flat Hilbert Bayesian Inference (FHBI), an algorithm designed to enhance generalization in Bayesian inference. Our approach involves an iterative twostep procedure with an adversarial functional perturbation step and a functional descent step within the reproducing kernel Hilbert spaces. This methodology is supported by a theoretical analysis that extends previous findings on generalization ability from finite-dimensional Euclidean spaces to infinite-dimensional functional spaces. To evaluate the effectiveness of FHBI, we conduct comprehensive comparisons against nine baseline methods on the VTAB-1K benchmark, which encompasses 19 diverse datasets across various domains with diverse semantics. Empirical results demonstrate that FHBI consistently outperforms the baselines by notable margins, highlighting its practical efficacy. Our code is available at https://anonymous.4open.science/r/Flat-Hilbert-Variational-Inference-008F/.
Paperid:628
Authors:Peiyuan Liu · Beiliang Wu · Yifan Hu · Naiqi Li · Tao Dai · Jigang Bao · Shutao Xia
Abstract:
Nonstationarity poses significant challenges for multivariate time series forecasting due to the inherent short-term fluctuations and long-term trends that can lead to spurious regressions or obscure essential long-term relationships. Most existing methods either eliminate or retain non-stationarity without adequately addressing its distinct impacts on short-term and long-term modeling. Eliminating non-stationarity is essential for avoiding spurious regressions and capturing local dependencies in short-term modeling, while preserving it is crucial for revealing long-term cointegration across variates. In this paper, we propose TimeBridge, a novel framework designed to bridge the gap between non-stationarity and dependency modeling in long-term time series forecasting. By segmenting input series into smaller patches, TimeBridge applies Integrated Attention to mitigate short-term non-stationarity and capture stable dependencies within each variate, while Cointegrated Attention preserves non-stationarity to model long-term cointegration across variates. Extensive experiments show that TimeBridge consistently achieves state-of-the-art performance in both short-term and long-term forecasting. Additionally, TimeBridge demonstrates exceptional performance in financial forecasting on the CSI 500 and S&P 500 indices, further validating its robustness and effectiveness. Code is available in the supplementary material.
Paperid:629
Authors:James Rowbottom · Georg Maierhofer · Teo Deveney · Eike Müller · Alberto Paganini · Katharina Schratz · Pietro Lió · Carola-Bibiane Schönlieb · Chris Budd
Abstract:
We present a novel, and effective, approach to achieve optimal mesh relocation in finite element methods (FEMs). The cost and accuracy of FEMs is critically dependent on the choice of mesh points. Mesh relocation (radaptivity) seeks to optimise the mesh geometry to obtain the best solution accuracy at given computational budget. Classical r-adaptivity relies on the solution of a separate nonlinear ``meshing'' PDE to determine mesh point locations. This incurs significant cost at remeshing, and relies on estimates that relate interpolation- and FEM-error. Recent machine learning approaches have focused on the construction of fast surrogates for such classical methods. Instead, our new approach trains a graph neural network (GNN) to determine mesh point locations by directly minimising the FE solution error from the PDE system Firedrake to achieve higher solution accuracy. Our GNN architecture closely aligns the mesh solution space to that of classical meshing methodologies, thus replacing classical estimates for optimality with a learnable strategy. This allows for rapid and robust training and results in an extremely efficient and effective GNN approach to online r-adaptivity. Our method outperforms both classical, and prior ML, approaches to r-adaptive meshing. In particular, it achieves lower FE solution error, whilst retaining the significant speed-up over classical methods observed in prior ML work.
Paperid:630
Authors:Jian Chen · Zhehao Li · Xiaojie Mao
Abstract:
We study the problem of classification with selectively labeled data, whose distribution may differ from the full population due to historical decisionmaking. We exploit the fact that in many applications historical decisions were made by multiple decision-makers, each with different decision rules. We analyze this setup under a principled instrumental variable (IV) framework and rigorously study the identification of classification risk. We establish conditions for the exact identification of classification risk and derive tight partial identification bounds when exact identification fails. We further propose a unified cost-sensitive learning (UCL) approach to learn classifiers robust to selection bias in both identification settings. We further theoretically and numerically validate the efficacy of our proposed method.
Paperid:631
Authors:Anish Acharya · Sujay Sanghavi · Alex Dimakis · Inderjit Dhillon
Abstract:
Abstract:Data pruning, the combinatorial task of selecting a small and informative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training datahungry modern deep learning models at scale. Since large-scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. In response, we propose Geometric Median Matching -- an iterative greedy approach -- that {\em yields a $k$-subset such that the mean of the subset approximates the geometric median of the (potentially) noisy dataset}. Theoretically, we show that Geometric Median Matching enjoys an improved $\gO(1/k)$ scaling over $\gO(1/\sqrt{k})$ scaling of uniform sampling; while achieving the optimal breakdown point of 1/2 even under arbitrary corruption. Extensive experiments across classification and generation benchmarks indicate that Geometric Median Matching consistently outperforms prior state-of-the-art; the gains become more profound at high rates of corruption and aggressive pruning rates; making it a strong baseline for robust data pruning.
Paperid:632
Authors:Linqi (Alex) Zhou · Stefano Ermon · Jiaming Song
Abstract:
Diffusion models and Flow Matching generate highquality samples but are slow at inference, and distilling them into few-step models often leads to instability and extensive tuning. To resolve these trade-offs, we propose Moment Matching Self-Distillation (MMSD), a new class of generative models for one- or few-step sampling with a single-stage training procedure. Unlike distillation, MMSD does not require pre-training initialization and optimization of two networks; and unlike Consistency Models, MMSD guarantees distribution-level convergence and remains stable under various hyperparameters and standard model architectures. MMSD surpasses diffusion models on ImageNet-256x256 with 2.13 FID using only 8 inference steps and achieves state-of-the-art 2-step FID of 2.05 on CIFAR-10 for a model trained from scratch.
Paperid:633
Authors:Alexander Bukharin · Yixiao Li · Pengcheng He · Tuo Zhao
Abstract:
Reward design is a fundamental, yet challenging aspect of reinforcement learning (RL). Researchers typically utilize feedback signals from the environment to handcraft a reward function, but this process is not always effective due to the varying scale and intricate dependencies of the feedback signals. This paper shows by exploiting certain structures, one can ease the reward design process. Specifically, we propose a hierarchical reward design framework HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning. Both scenarios allow us to design a hierarchical decision tree induced by the importance ranking of the feedback signals to compare RL trajectories. With such preference data, we can then train a reward model for policy learning. We apply HERON to several RL applications, and we find that our framework can not only train high performing agents on a variety of difficult tasks, but also provide additional benefits such as improved sample efficiency and robustness.
Paperid:634
Authors:Long-Kai Huang · Rongyi Zhu · Bing He · Jianhua Yao
Abstract:
Protein Language Models (PLMs), pretrained on extensive evolutionary data from natural proteins, have emerged as indispensable tools for protein design. While powerful, PLMs often struggle to produce proteins with precisely specified functionalities or properties due to inherent challenges in controlling their outputs. In this work, we investigate the potential of Activation Steering - a technique originally developed for controlling text generation in Large Language Models (LLMs) - to direct PLMs toward generating protein sequences with targeted properties. We introduce a simple yet effective method that employs activation editing to steer PLM outputs, and extend this approach to protein optimization through a novel editing site identification module. Through comprehensive experiments on both auto-encoder and autoregressive PLMs focusing on lysozyme-like sequence generation and optimization, we demonstrate that our methods can be seamlessly integrated into existing PLMs without requiring additional training. This work establishes a new paradigm for precise protein engineering using foundation models.
Paperid:635
Authors:Lena Stempfle · Anton Matsson · Newton Mwai · Fredrik Johansson
Abstract:
Handling missing values at test time is challenging for machine learning models, especially when aiming for both high accuracy and interpretability. Established approaches often add bias through imputation or excessive model complexity via missingness indicators. Moreover, either method can obscure interpretability, making it harder to understand how the model utilizes the observed variables in predictions. We proposemissingnessavoiding(MA) machine learning, a general framework for training models to rarely require the values of missing (or imputed) features at test time. We create tailored MA learning algorithms for decision trees, tree ensembles, and sparse linear models by incorporating classifier-specific regularization terms in their learning objectives. The tree-based models leverage contextual missingness by reducing reliance on missing values based on the observed context. Experiments on real-world datasets demonstrate thatMA-DT, MA-LASSO, MA-RF, andMA-GBTeffectively reduce the reliance on features with missing values while maintaining predictive performance competitive with their unregularized counterparts. This shows that our framework gives practitioners a powerful tool to maintain interpretability in predictions with test-time missing values.
Paperid:636
Authors:Min Chen · Guansong Pang · Wenjun Wang · Cheng Yan
Abstract:
Spatialtemporal forecasting (STF) plays a pivotal role in urban planning and computing. Spatial-Temporal Graph Neural Networks (STGNNs) excel in modeling spatial-temporal dynamics, thus being robust against noise perturbation. However, they often suffer from relatively poor computational efficiency. Simplifying the architectures can speed up these methods but it also weakens the robustness w.r.t. noise interference. In this study, we aim to investigate the problem --can simple neural networks such as Multi-Layer Perceptrons (MLPs) achieve robust spatial-temporal forecasting yet still be efficient?To this end, we first disclose thedual noise effectbehind the spatial-temporal data noise, and propose theoretically-grounded principle termedRobust Spatial-Temporal Information Bottleneck (RSTIB)principle, which preserves wide potentials for enhancing the robustness of different types of models. We then meticulously design an implementation, termedRSTIB-MLP, along with a new training regime incorporating a knowledge distillation module, to enhance the robustness of MLPs for STF while maintaining its efficiency. Comprehensive experimental results show that an excellent trade-off between the robustness and the efficiency can be achieved byRSTIB-MLPcompared to state-of-the-art STGNNS and MLP models.
Paperid:637
Authors:Yeonju Ro · Zhenyu Zhang · Souvik Kundu · Zhangyang “Atlas” Wang · Aditya Akella
Abstract:
Large language models (LLMs) excel at capturing global token dependencies via selfattention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first proposedual-state linear attention(DSLA), a novel design that maintains two specialized hidden states—one for preserving historical context and one for tracking recency—thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce DSLA-Serve, an onlineadaptive distillationframework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. DSLA-Serveuses a chained fine-tuning strategy to ensure that each newly converted DSLA layer remains consistent with previously replaced layers, preserving the overall quality. Extensive evaluations on commonsense reasoning, long-context QA, and text summarization demonstrate that DSLA-Serveyields2.3×faster inference than Llama2-7B and3.2×faster than the hybrid Zamba-7B, while retaining comparable performance across downstream tasks. Our ablation studies show that DSLA’s dual states capture both global and local dependencies, addressing the historical-token underrepresentation seen in prior linear attentions.
Paperid:638
Authors:Xi Chen · Yateng Tang · Jiarong Xu · Jiawei Zhang · Siwei Zhang · Sijia Peng · Xuehao Zheng · Yun Xiong
Abstract:
Effectively modeling time information and incorporating it into applications or models involving chronologically occurring events is crucial. Realworld scenarios often involve diverse and complex time patterns, which pose significant challenges for time encoding methods. While previous methods focus on capturing time patterns, many rely on specific inductive biases, such as using trigonometric functions to model periodicity. This narrow focus on single-pattern modeling makes them less effective in handling the diversity and complexities of real-world time patterns. In this paper, we investigate to improve the existing commonly used time encoding methods and introduce Learnable Transformation-based Generalized Time Encoding (LeTE). We propose using deep function learning techniques to parameterize nonlinear transformations in time encoding, making them learnable and capable of modeling generalized time patterns, including diverse and complex temporal dynamics. By enabling learnable transformations, LeTE encompasses previous methods as specific cases and allows seamless integration into a wide range of tasks. Through extensive experiments across diverse domains, we demonstrate the versatility and effectiveness of LeTE.
Paperid:639
Authors:Pengxiang Zhao · Xiaoming Yuan
Abstract:
Abstract:Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While lowbit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ's quantized models achieve up to 2.57$\times$ speedup over the baseline, advancing memory and inference efficiency in LLM deployment.
Paperid:640
Authors:Pranav Nair · Puranjay Datta · Jeff Dean · Prateek Jain · Aditya Kusupati
Abstract:
Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. Leveraging this insight, in this paper, we propose Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that alleviates the aforementioned challenge. This technique allows us to train and maintain a single quantized model but serve it with the precision demanded by the deployment. Furthermore, due to the co-training and co-distillation regularization provided by MatQuant, the int2 precision models extracted by MatQuant can be up to 10% more accurate than standard int2 quantization (using techniques like QAT or OmniQuant). This represents significant progress in model quantization, demonstrated by the fact that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model.
Paperid:641
Authors:Jonas Schweisthal · Dennis Frauen · Maresa Schröder · Konstantin Hess · Niki Kilbertus · Stefan Feuerriegel
Abstract:
Reliable estimation of treatment effects from observational data is important in many disciplines such as medicine. However, estimation is challenging when unconfoundedness as a standard assumption in the causal inference literature is violated. In this work, we leverage arbitrary (potentially highdimensional) instruments to estimate bounds on the conditional average treatment effect (CATE). Our contributions are three-fold: (1) We propose a novel approach for partial identification through a mapping of instruments to a discrete representation space so that we yield valid bounds on the CATE. This is crucial for reliable decision-making in real-world applications. (2) We derive a two-step procedure that learns tight bounds using a tailored neural partitioning of the latent instrument space. As a result, we avoid instability issues due to numerical approximations or adversarial training. Furthermore, our procedure aims to reduce the estimation variance in finite-sample settings to yield more reliable estimates. (3) We show theoretically that our procedure obtains valid bounds while reducing estimation variance. We further perform extensive experiments to demonstrate the effectiveness across various settings. Overall, our procedure offers a novel path for practitioners to make use of potentially high-dimensional instruments (e.g., as in Mendelian randomization).
Paperid:642
Authors:Max Wilcoxson · Qiyang Li · Kevin Frans · Sergey Levine
Abstract:
Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that finetuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled offline trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-labels unlabeled trajectories with optimistic rewards and high-level action labels, transforming prior data into high-level, task-relevant examples that encourage novelty-seeking behavior. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. In our experiments, SUPE consistently outperforms prior strategies across a suite of 42 long-horizon, sparse-reward tasks.
Paperid:643
Authors:Ziming Hong · Runnan Chen · Zengmao Wang · Bo Han · Bo Du · Tongliang Liu
Abstract:
Datafree knowledge distillation (DFKD) transfers knowledge from a teacher to a student without access the real in-distribution (ID) data. Its common solution is to use a generator to synthesize fake data and use them as a substitute for real ID data. However, existing works typically assume teachers are trustworthy, leaving the robustness and security of DFKD from untrusted teachers largely unexplored. In this work, we conduct the first investigation into distilling non-transferable learning (NTL) teachers using DFKD, where the transferability from an ID domain to an out-of-distribution (OOD) domain is prohibited. We find that NTL teachers fool DFKD through divert the generator's attention from the useful ID knowledge to the misleading OOD knowledge. This hinders ID knowledge transfer but prioritizes OOD knowledge transfer. To mitigate this issue, we propose Adversarial Trap Escaping (ATEsc) to benefit DFKD by identifying and filtering out OOD-like synthetic samples. Specifically, inspired by the evidence that NTL teachers show stronger adversarial robustness on OOD samples than ID samples, we split synthetic samples into two groups according to their robustness. The fragile group is treated as ID-like data and used for normal knowledge distillation, while the robust group is seen as OOD-like data and utilized for forgetting OOD knowledge. Extensive experiments demonstrate the effectiveness of ATEsc for improving DFKD against NTL teachers.
Paperid:644
Authors:Hu Zican · Wei Liu · Xiaoye Qu · Xiangyu Yue · Chunlin Chen · Zhi Wang · Yu Cheng
Abstract:
While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with longhorizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative frameworkGLIDER(GroundingLanguage Models as EffIcientDecision-Making Agents via Offline HiErarchicalReinforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.
Paperid:645
Authors:Haoran He · Yuxiao Ye · Emmanuel Bengio · Qingpeng Cai · Ling Pan
Abstract:
The Generative Flow Network (GFlowNet) is a probabilistic framework in which an agent learns a stochastic policy and flow functions to sample objects with probability proportional to an unnormalized reward function. GFlowNets share a strong connection with reinforcement learning (RL) that typically aims to maximize reward. A number of recent works explored connections between GFlowNets and maximum entropy (MaxEnt) RL, which incorporates entropy regularization into the standard RL objective. However, the relationship between GFlowNets and standard RL remains largely unexplored, despite the inherent similarities in their sequential decisionmaking nature.While GFlowNets can discover diverse solutions through specialized flow-matching objectives, connecting them to standard RL can simplify their implementation through well-established RL principles and also improve RL's capabilities in diverse solution discovery (a critical requirement in many real-world applications), and bridging this gap can further unlock the potential of both fields. In this paper, we bridge this gap by revealing a fundamental connection between GFlowNets and one of the most basic components of RL -- policy evaluation. Surprisingly, we find that the value function obtained from evaluating a uniform policy is closely associated with the flow functions in GFlowNets. Building upon these insights, we introduce a rectified random policy evaluation (RPE) algorithm, which achieves the same reward-matching effect as GFlowNets based on simply evaluating a fixed random policy, offering a new perspective. Empirical results across extensive benchmarks demonstrate that RPE achieves competitive results compared to previous approaches, shedding light on the previously overlooked connection between (non-MaxEnt) RL and GFlowNets.
Paperid:646
Authors:Jiacheng Zhang · Benjamin Rubinstein · Jingfeng ZHANG · Feng Liu
Abstract:
Statistical adversarial data detection(SADD) detects whether an upcoming batch contains adversarial examples (AEs) by measuring the distributional discrepancies between clean examples (CEs) and AEs. In this paper, we reveal the potential strength of SADDbased methods by theoretically showing that minimizing distributional discrepancy can help reduce the expected loss on AEs. Nevertheless, despite these advantages, SADD-based methods have a potential limitation: they discard inputs detected as AEs, leading to the loss of clean information within those inputs. To address this limitation, we propose a two-pronged adversarial defense method, namedDistributional-Discrepancy-basedAdversarialDefense (DDAD). In the training phase, DDAD first optimizes the test power of themaximum mean discrepancy(MMD) to derive MMD-OPT, and then trains a denoiser by minimizing the MMD-OPT between CEs and AEs. In the inference phase, DDAD first leverages MMD-OPT to differentiate CEs and AEs, and then applies a two-pronged process: (1) directly feeding the detected CEs into the classifier, and (2) removing noise from the detected AEs by the distributional-discrepancy-based denoiser. Extensive experiments show that DDAD outperforms currentstate-of-the-art(SOTA) defense methods by notably improving clean and robust accuracy on CIFAR-10 and ImageNet-1K against adaptive white-box attacks. The code is available at: https://anonymous.4open.science/r/DDAD-DB60.
Paperid:647
Authors:Zhiyao Ren · Siyuan Liang · Aishan Liu · Dacheng Tao
Abstract:
Incontext learning (ICL) has demonstrated remarkable success in large language models (LLMs) due to its adaptability and parameter-free nature. However, it also introduces a critical vulnerability to backdoor attacks, where adversaries can manipulate LLM behaviors by simply poisoning a few ICL demonstrations. In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts within poisoned demonstrations, jointly influencing the probability of model outputs. Through theoretical analysis, we derive an upper bound for ICL backdoor effects, revealing that the vulnerability is dominated by the concept preference ratio between the task and the backdoor. Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio. Our method encourages LLMs to select clean demonstrations during the ICL phase by leveraging confidence and similarity scores, effectively mitigating susceptibility to backdoor attacks. Extensive experiments across multiple LLMs and tasks demonstrate that our method achieves state-of-the-art defense effectiveness, significantly outperforming existing approaches (+26.02\% on average). Furthermore, our method exhibits exceptional adaptability and defensive performance even for closed-source models (\eg, GPT-4). Ours codes can be found at Anonymous Github.
Paperid:648
Authors:Duo Liu · Zhiquan Tan · Linglan Zhao · Zhongqiang Zhang · Xiangzhong Fang · Weiran Huang
Abstract:
Generalized Category Discovery (GCD) aims to identify unlabeled samples by leveraging the base knowledge from labeled ones, where the unlabeled set consists of both base and novel classes. Since clustering methods are timeconsuming at inference, parametric-based approaches have become more popular. However, recent parametric-based methods suffer inferior base discrimination due to the unreliable self-supervision. To address this issue, we propose a Reciprocal Learning Framework (RLF) that introduces an auxiliary branch devoted to base classification. During training, the main branch filters the pseudo-base samples to the auxiliary branch. In response, the auxiliary branch provides more reliable soft labels for the main branch, leading to a virtuous cycle. Furthermore, we introduce Class-wise Distribution Regularization (CDR) to mitigate the leaning bias towards base classes. CDR essentially increases the prediction confidence of the unlabeled data and boosts the novel class performance. Combined with both components, our proposed method, RLCD, achieves superior performance in all classes with negligible extra computation. Comprehensive experiments across seven GCD datasets validate its superiority. Our codes will be available upon acceptance.
Paperid:649
Authors:Kexuan Shi · Hai Chen · Leheng Zhang · Shuhang Gu
Abstract:
Implicit Neural Representations (INRs), as a versatile representation paradigm, have achieved success in various computer vision tasks. Due to the spectral bias of the vanilla multilayer perceptrons (MLPs), existing methods focus on designing MLPs with sophisticated architectures or repurposing training techniques for highly accurate INRs. In this paper, we delve into the linear dynamics model of MLPs and theoretically identify the empirical Neural Tangent Kernel (eNTK) matrix as a reliable link between spectral bias and training dynamics. Based on this insight, we propose a practicalInductiveGradientAdjustment (IGA) method, which could purposefully improve the spectral bias via inductive generalization of eNTK-based gradient transformation matrix. Theoretical and empirical analyses validate impacts of IGA on spectral bias. Further, we evaluate our method on different INRs tasks with various INR architectures and compare to existing training techniques. The superior and consistent improvements clearly validate the advantage of our IGA. Armed with our gradient adjustment method, better INRs with more enhanced texture details and sharpened edges can be learned from data by tailored impacts on spectral bias.
Paperid:650
Authors:Haojie Duanmu · Xiuhong Li · Zhihang Yuan · Size Zheng · Jiangfei Duan · Xingcheng ZHANG · Dahua Lin
Abstract:
Abstract:Mixtureof-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE, a mixed-precision optimization framework for MoE models that considers both algorithmic and system perspectives. MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations. Additionally, MxMoE automatically generates optimized mixed-precision Group-GEMM kernels, enabling parallel execution of GEMMs with different precisions. Evaluations show that MxMoE outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4$\times$ speedup over full precision at 5-bit weight-activation quantization.
Paperid:651
Authors:Qi He · Peiran Yu · Ziyi Chen · Heng Huang
Abstract:
Shufflingtype gradient methods are favored in practice for their simplicity and rapid empirical performance. Despite extensive development of convergence guarantees under various assumptions in recent years, most require the Lipschitz smoothness condition, which is often not met in common machine learning models. We highlight this issue with specific counterexamples. To address this gap, we revisit the convergence rates of shuffling-type gradient methods without assuming Lipschitz smoothness. Using our stepsize strategy, the shuffling-type gradient algorithm not only converges under weaker assumptions but also match the current best-known convergence rates, thereby broadening its applicability. We prove the convergence rates for nonconvex, strongly convex, and non-strongly convex cases, each under both random reshuffling and arbitrary shuffling schemes, under a general bounded variance condition. Numerical experiments further validate the performance of our shuffling-type gradient algorithm, underscoring its practical efficacy.
Paperid:652
Authors:Edward Pearce-Crump
Abstract:
Abstract:Group equivariant neural networks have proven effective in modelling a wide range of tasks where the data lives in a classical geometric space and exhibits welldefined group symmetries. However, these networks are not suitable for learning from data that lives in a non-commutative geometry, described formally by non-commutative $\mathcal{C}^{\ast}$-algebras, since the $\mathcal{C}^{\ast}$-algebra of continuous functions on a compact matrix group is commutative. To address this limitation, we derive the existence of a new type of equivariant neural network, called compact matrix quantum group equivariant neural networks, which encode symmetries that are described by compact matrix quantum groups. We characterise the weight matrices that appear in these neural networks for the easy compact matrix quantum groups, which are defined by set partitions. As a result, we obtain new characterisations of equivariant weight matrices for some compact matrix groups that have not appeared previously in the machine learning literature.
Paperid:653
Authors:Mingjun Wang · Yihan Wen · Bin Sun · Jianan Mu · Juan Li · Xiaoyi Wang · Jing Ye · Bei Yu · Huawei Li
Abstract:
Accurate and efficient timing prediction at the registertransfer level (RTL) remains a fundamental challenge in electronic design automation (EDA), particularly in balancing accuracy with computational efficiency. While static timing analysis (STA) provides high-fidelity results through comprehensive physical parameters, its computational overhead makes it impractical for rapid design iterations. Conversely, existing RTL-level approaches sacrifice accuracy due to limited physical information. We propose RTLDistil, a novel cross-stage knowledge distillation framework that bridges this gap by transferring precise physical characteristics from a layout-aware teacher model (Teacher GNN) to an efficient RTL-level student model (Student GNN), both implemented as graph neural networks (GNNs). RTLDistil efficiently predicts key timing metrics, such as arrival time (AT), and employs a multi-granularity distillation strategy that captures timing-critical features at node, subgraph, and global levels. Experimental results demonstrate that RTLDistil achieves significant improvement of RTL-level timing prediction error reduction, compared to state-of-the-art prediction models. This framework enables accurate early-stage timing prediction, advancing EDA's ``left-shift'' paradigm while maintaining computational efficiency.
Paperid:654
Authors:Lucas Torroba Hennigen · Hunter Lang · Han Guo · Yoon Kim
Abstract:
We study memoryefficient optimization of neural networks withlinear gradient transformations, where the gradients are linearly mapped to a lower dimensional space than the full parameter space, thus saving memory required for gradient accumulation and optimizer state persistence. The model parameters are updated by first performing an optimization step in the lower dimensional space and then going back into the original parameter space via the linear map's transpose. We show that optimizing the model in this transformed space is equivalent to reparameterizing the original model through alinear adapterthat additively modifies the model parameters, and then only optimizing the adapter's parameters. When the transformation is Kronecker-factored, this establishes an equivalence between GaLore and one-sided LoRA. We show that this duality between gradient transformations and adapter-based reparameterizations unifies existing approaches to memory-efficient training and suggests new techniques for improving training efficiency and memory use.
Paperid:655
Authors:Jiayuan Zhang · Xuefeng Liu · Jianwei Niu · Shaojie Tang · Haotian Yang · Xinghao Wu
Abstract:
The outof-distribution (OOD) generalization problem in federated learning (FL) has recently attracted significant research interest. A common approach, derived from centralized learning, is to extract causal features which exhibit causal relationships with the label. However, in FL, the global feature extractor typically captures only invariant causal features shared across clients and thus discards many other causal features that are potentially useful for OOD generalization. To address this problem, we propose FedUni, a simple yet effective architecture trained to extract all possible causal features from any input. FedUni consists of a comprehensive feature extractor, designed to identify a union of all causal feature types in the input, followed by a feature compressor, which discards potentialfakecausal features. With this architecture, FedUni can benefit from collaborative training in FL while avoiding the cost of model aggregation (i.e., extracting only invariant features). In addition, to further enhance the feature extractor's ability to capture causal features, FedUni add a causal intervention module on the client side, which employs a counterfactual generator to generate counterfactual examples that simulate distributions shifts. Extensive experiments and theoretical analysis demonstrate that our method significantly improves OOD generalization performance.
Paperid:656
Authors:Xuyang Chen · Lin Zhao
Abstract:
Actorcritic algorithms have been instrumental in boosting the performance of numerous challenging applications involving continuous control, such as highly robust and agile robot motion control. However, their theoretical understanding remains largely underdeveloped. Existing analyses mostly focus on finite state-action spaces and on simplified variants of actor-critic, such as double-loop updates with i.i.d. sampling, which are often impractical for real-world applications.We consider the canonical and widely adopted single-timescale updates with Markovian sampling in continuous state-action space. Specifically, we establish finite-time convergence by introducing a novel Lyapunov analysis framework, which provides a unified convergence characterization of both the actor and the critic. Our approach is less conservative than previous methods and offers new insights into the coupled dynamics of actor-critic updates.
Paperid:657
Authors:Batuhan Karaman · ishmam zabir · Alon Benhaim · Vishrav Chaudhary · Mert Sabuncu · Xia Song
Abstract:
Achieving both high safety and high usefulness simultaneously in large language models has become a critical challenge in recent years.Models often exhibit unsafe behavior or adopt an overly cautious approach leading to frequent overrefusal of benign prompts, which reduces their usefulness. A major factor underlying these behaviors is how the models are finetuned and aligned, particularly the nature and extent of the data used.In this work, we examine how overgenerating finetuning data with advanced teacher models (e.g., GPT4o)—covering both general-purpose and toxic prompts—affects safety and usefulness in instruction-following language models.Additionally, we present POROver, an alignment strategy designed for models that are highly safe but prone to overrefusal. POROver employs preference optimization algorithms and leverages completions from an advanced teacher model to reduce overrefusals while maintaining safety.Our results show that overgenerating completions for general-purpose prompts significantly boosts safety with only a minimal impact on usefulness. Specifically, the F1 score calculated between safety and usefulness increases from 74.4% to 91.8% because of a substantial rise in safety. Moreover, overgeneration for toxic prompts raises usefulness from 11.1% to 57.6% while preserving safety. Finally, applying POROVer increases usefulness further—from 57.6% to 82.1%—while keeping safety at comparable levels.
Paperid:658
Authors:Chen-Xiao Gao · Chenyang Wu · Mingjun Cao · Chenjun Xiao · Yang Yu · Zongzhang Zhang
Abstract:
The primary focus of offline reinforcement learning (RL) is to manage the risk of hazardous exploitation of outof-distribution actions. An effective approach to achieve this goal is through behavior regularization, which augments conventional RL objectives by incorporating constraints that enforce the policy to remain close to the behavior policy. Nevertheless, existing literature on behavior-regularized RL primarily focuses on explicit policy parameterizations, such as Gaussian policies. Consequently, it remains unclear how to extend this framework to more advanced policy parameterizations, such as diffusion models. In this paper, we introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies, thereby combining the expressive power of diffusion policies and the robustness provided by regularization. The key ingredient of our method is to calculate the Kullback-Leibler (KL) regularization analytically as the accumulated discrepancies in reverse-time transition kernels along the diffusion trajectory. By integrating the regularization, we develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint. Comprehensive evaluations conducted on synthetic 2D tasks and continuous control tasks from the D4RL benchmark validate its effectiveness and superior performance.
Paperid:659
Authors:Zhicheng Jiang · Qiao Sun · Hanhong Zhao · Kaiming He
Abstract:
It is widely believed that noise conditioning is indispensable for denoising diffusion models to work successfully. This work challenges this belief. Motivated by research on blind image denoising, we investigate a variety of denoisingbased generative models in the absence of noise conditioning. To our surprise, most models exhibit graceful degradation, and in some cases, they even perform better without noise conditioning. We provide a mathematical analysis of the error introduced by removing noise conditioning and demonstrate that our analysis aligns with empirical observations. We further introduce a noise-unconditionalmodel that achieves a competitive FID of 2.23 on CIFAR-10, significantly narrowing the gap to leading noise-conditional models. We hope our findings will inspire the community to revisit the foundations and formulations of denoising generative models.
Paperid:660
Authors:zhou changshi · Feng Luan · hujiarui · Shaoqiang Meng · Zhipeng Wang · Yanchao Dong · Yanmin Zhou · Bin He
Abstract:
Garment manipulation is a significant challenge for robots due to the complex dynamics and potential selfocclusion of garments. Most existing methods of efficient garment unfolding overlook the crucial role of standardization of flattened garments, which could significantly simplify downstream tasks like folding, ironing, and packing. This paper presents APS-Net, a novel approach to garment manipulation that combines unfolding and standardization in a unified framework. APS-Net employs a dual-arm, multi-primitive policy with dynamic fling to quickly unfold crumpled garments and pick-and-place(p&p) for precise alignment. The purpose of garment standardization during unfolding involves not only maximizing surface coverage but also aligning the garment’s shape and orientation to predefined requirements. To guide effective robot learning, we introduce a novel factorized reward function for standardization, which incorporates garment coverage (Cov), keypoint distance (KD), and intersection-over-union (IoU) metrics. Additionally, we introduce a spatial action mask and an Action Optimized Module to improve unfolding efficiency by selecting actions and operation points effectively. In simulation, APS-Net outperforms state-of-the-art methods for long sleeves, achieving 3.9% better coverage, 5.2% higher IoU, and a 0.14 decrease in KD (7.09% relative reduction). Real-world folding tasks further demonstrate that standardization simplifies the folding process. Project page: https://hellohaia.github.io/APS/
Paperid:661
Authors:Zebin You · Jingyang Ou · Xiaolu Zhang · Jun Hu · JUN ZHOU · Chongxuan Li
Abstract:
Abstract:Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as **eMIGM**. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet $256\times256$, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the stateof-the-art continuous diffusion models while requiring less than 40\% of the NFE. Additionally, on ImageNet $512\times512$, with only about 60\% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models.
Paperid:662
Authors:Yingpeng Tang · Chao Ren · Xiaoli Tang · Sheng-Jun Huang · Lizhen Cui · Han Yu
Abstract:
Federated Active Learning (FAL) aims to learn an effective global model, while minimizing label queries. Owing to privacy requirements, it is challenging to design effective active data selection schemes due to the lack of crossclient query information. In this paper, we bridge this important gap by proposing the \underline{F}ederated \underline{A}ctive data selection by \underline{LE}verage score sampling (FALE) method. It is designed for regression tasks in the presence of non-i.i.d. client data to enable the server to select data globally in a privacy-preserving manner. Based on FedSVD, FALE aims to estimate the utility of unlabeled data and perform data selection via leverage score sampling. Besides, a secure model learning framework is designed for federated regression tasks to exploit supervision. FALE can operate without requiring an initial labeled set and select the instances in a single pass, significantly reducing communication overhead. Theoretical analyze establishes the query complexity for FALE to achieve constant factor approximation and relative error approximation. Extensive experiments on 11 benchmark datasets demonstrate significant improvements of FALE over existing state-of-the-art methods.
Paperid:663
Authors:Jiawei Zhang · Shuang Yang · Bo Li
Abstract:
Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for handling complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements also amplify the risks of adversarial attacks, particularly when LLM agents can access sensitive external functionalities. Moreover, because LLM agents engage in extensive reasoning or planning before executing final actions, manipulating them into performing targeted malicious actions or invoking specific tools remains a significant challenge. Consequently, directly embedding adversarial strings in malicious instructions or injecting malicious prompts into tool interactions has become less effective against modern LLM agents. In this work, we present UDora, a unified red teaming framework designed for LLM Agents that dynamically leverages the agent's own reasoning processes to compel it toward malicious behavior. Specifically, UDora first samples the model's reasoning for the given task, then automatically identifies multiple optimal positions within these reasoning traces to insert targeted perturbations. Subsequently, it uses the modified reasoning as the objective to optimize the adversarial strings. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets.
Paperid:664
Authors:Justin S. Kang · Landon Butler · Abhineet Agarwal · Yigit Efe Erginbas · Ramtin Pedarsani · Bin Yu · Kannan Ramchandran
Abstract:
Abstract:Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular posthoc explanation methods like SHAP provide *marginal* feature attributions, while their extensions to interaction importances only scale to small input lengths ($\approx 20$). We propose *Spectral Explainer* (SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths ($\approx 1000)$. SPEX exploits underlying natural sparsity among interactions—common in real-world data—and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20\% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output. For one of our datasets, *HotpotQA*, SPEX provides interactions that align with human annotations. Finally, we use our model-agnostic approach to generate explanations to demonstrate abstract reasoning in closed-source LLMs (*GPT-4o mini*) and compositional reasoning in vision-language models.
Paperid:665
Authors:Jihwan Jeong · Xiaoyu Wang · Jingmin Wang · Scott Sanner · Pascal Poupart
Abstract:
Offline reinforcement learning (RL) is crucial when online exploration is costly or unsafe but often struggles with high epistemic uncertainty due to limited data. Existing methods rely on fixed conservative policies, restricting adaptivity and generalization. To address this, we propose Reflectthen-Plan (RefPlan), a noveldoubly Bayesianoffline model-based (MB) planning approach. RefPlan unifies uncertainty modeling and MB planning by recasting planning as Bayesian posterior estimation. At deployment, it updates a belief over environment dynamics using real-time observations, incorporating uncertainty into MB planning via marginalization. Empirical results on standard benchmarks show that RefPlan significantly improves the performance of conservative offline RL policies. In particular, RefPlan maintains robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics, improving the flexibility, generalizability, and robustness of offline-learned policies.
Paperid:666
Authors:Weihan Li · Yule Wang · Chengrui Li · Anqi Wu
Abstract:
Understanding and constructing brain communications that capture dynamic communications across multiple regions is fundamental to modern system neuroscience, yet current methods struggle to find timevarying region-level communications or scale to large neural datasets with long recording durations. We present a novel framework using Markovian Gaussian Processes to learn brain communications with time-varying temporal delays from multi-region neural recordings, named Adaptive Delay Model (ADM). Our method combines Gaussian Processes with State Space Models and employs parallel scan inference algorithms, enabling efficient scaling to large datasets while identifying concurrent communication patterns that evolve over time. This time-varying approach captures how brain region interactions shift dynamically during cognitive processes. Validated on synthetic and multi-region neural recordings datasets, our approach discovers both the directionality and temporal dynamics of neural communication. This work advances our understanding of distributed neural computation and provides a scalable tool for analyzing dynamic brain networks.
Paperid:667
Authors:Yuanwei Zhang · Fengmiao Bian · Xiaoqun Zhang · Jian-Feng Cai
Abstract:
Tensors play a crucial role in numerous scientific and engineering fields. This paper addresses the lowmultilinear-rank tensor completion problem, a fundamental task in tensor-related applications. By exploiting the manifold structure inherent to the fixed-multilinear-rank tensor set, we introduce a simple yet highly effective preconditioned Riemannian metric and propose the Preconditioned Riemannian Gradient Descent (PRGD) algorithm. Compared to the standard Riemannian Gradient Descent (RGD), PRGD achieves faster convergence while maintaining the same order of per-iteration computational complexity. Theoretically, we provide the recovery guarantee for PRGD under near-optimal sampling complexity. Numerical results highlight the efficiency of PRGD, outperforming state-of-the-art methods on both synthetic data and real-world video inpainting tasks.
Paperid:668
Authors:Qiuxia Lin · Glory Rongyu CHEN · Kerui Gu · Angela Yao
Abstract:
This work highlights a semantics misalignment in 3D Human Pose Estimation. For the task of testtime adaptation, the misalignment manifests as over-smoothed and unguided predictions. The smoothing settles predictions towards some average pose. Furthermore, when there are occlusions or truncations, the adaptation becomes fully unguided. To this end, we pioneer the integration of a semantics-aware motion prior for the test-time adaptation of 3D pose estimation. We leverage video understanding and a well-structured motion-text space to adapt the model motion prediction to adhere to video semantics during test time. Additionally, we incorporate a missing 2D pose completion based on the motion-text similarity. The pose completion strengthens the motion prior's guidance for occlusions and truncations. Our method significantly improves state-of-the-art 3D human pose estimation TTA techniques, with more than 12% decrease in PA-MPJPE on 3DPW and 3DHP.
Paperid:669
Authors:Zhuohao Yu · Weizheng Gu · Yidong Wang · Xingru Jiang · Zhengran Zeng · Jindong Wang · Wei Ye · Shikun Zhang
Abstract:
Large Language Models excel at code generation yet struggle with complex programming tasks that demand sophisticated reasoning. To bridge this gap, traditional process supervision relies on learned reward models requiring costly training data and suffering from reward misalignment, while outcome supervision fails for complex tasks needing coordinated intermediate steps.We introduceOutcomeRefiningProcessSupervision, which unifies process and outcome supervision by leveraging executable verification: a treestructured search framework generates strategic alternatives, profiles execution metrics, and scores candidates via self-critique mechanisms that integrate runtime feedback with reasoning.Experiments across 5 models and 3 benchmarks show consistent gains, with26.9\%higher correctness and42.2\%improved code efficiency. The results demonstrate that ORPS enables LLMs to overcome local optima in code generation, suggesting a promising direction for combining verifiable outcomes with structured reasoning to tackle complex challenges.
Paperid:670
Authors:Zhixian Xie · Haode Zhang · Yizhe Feng · Wanxin Jin
Abstract:
Reward design for reinforcement learning and optimal control agents is challenging. Preferencebased alignment addresses this by enabling agents to learn rewards from ranked trajectory pairs provided by humans. However, existing methods often struggle from poor robustness to unknown false human preferences. In this work, we propose a robust and efficient reward alignment method based on a novel and geometrically interpretable perspective: hypothesis space batched cutting. Our method iteratively refines the reward hypothesis space through “cuts” based on batches of human preferences. Within each batch, human preferences, queried based on disagreement, are grouped using a voting function to determine the appropriate cut, ensuring a bounded human query complexity.To handle unknown erroneous preferences, we introduce a conservative cutting method within each batch, preventing erroneous human preferences from making overly aggressive cuts to the hypothesis space. This guarantees provable robustness against false preferences. We evaluate our method in a model predictive control setting across diverse tasks, including DM-Control, dexterous in-hand manipulation, and locomotion. The results demonstrate that our framework achieves comparable or superior performance to state-of-the-art methods in error-free settings while significantly outperforming existing method when handling high percentage of erroneous human preferences.
Paperid:671
Authors:Ruiqi Feng · Tailin Wu · Chenglei Yu · Wenhao Deng · Peiyan Hu
Abstract:
Flow matching has shown stateof-the-art performance in various generative tasks, ranging from image generation to decision-making, where guided generation is pivotal. However, the guidance of flow matching is more general than and thus substantially different from that of its predecessor, diffusion models. Therefore, the challenge in guidance for general flow matching remains largely underexplored. In this paper, we propose the first framework of general guidance for flow matching. From this framework, we derive a family of guidance techniques that can be applied to general flow matching. These include a new training-free asymptotically exact guidance, novel training losses for training-based guidance, and two classes of approximate guidance that cover classical gradient guidance methods as special cases. We theoretically investigate these different methods to give a practical guideline for choosing suitable methods in different scenarios. Experiments on synthetic datasets, image inverse problems, and offline reinforcement learning demonstrate the effectiveness of our proposed guidance methods and verify the correctness of our flow matching guidance framework. Code to reproduce the experiments can be found at https://anonymous.4open.science/r/GuidedFlow-BB15.
Paperid:672
Authors:Yiting He · Zhishuai Liu · Weixin Wang · Pan Xu
Abstract:
In offdynamics reinforcement learning (RL), distribution shifts between training and deployment environments necessitate robust policy learning under uncertain dynamic changes, which can be formulated as a robust Markov decision process (RMDP). Unlike existing literature that often assumes access to pre-collected datasets with a good coverage of the deployment environment, we focus on the setting where only online interaction with the training environment is allowed. Online RMDPs require advanced exploration strategies to gather data that effectively estimates robust policies for deployment environments. This poses unique challenges due to distributional shifts compared to standard MDPs. We study two types of RMDPs addressing dynamic uncertainty differently: Constrained Robust Markov Decision Processes (CRMDPs) that optimize worst-case outcomes within a predefined set of dynamics and Regularized Robust Markov Decision Processes (RRMDPs) that use a soft penalty via regularization in its optimization. We introduce the supremal visitation ratio as a key metric for evaluating an agent's exploration efficiency in online RMDPs and show that when this ratio is unbounded, online learning becomes exponentially hard. We develop computationally efficient algorithms that achieve sublinear regret for both CRMDPs and RRMDPs and establish regret lower bounds demonstrating our algorithm enjoys the optimal dependence on the supremal visitation ratio and the number of interaction episodes. We further conduct comprehensive numerical experiments in diverse environments to validate the robustness of our method.
Paperid:673
Authors:Dongwon Kim · Donghee Kim · Sung Kuk Shyn · Kwangsu Kim
Abstract:
Federated learning (FL) often struggles with generalization due to heterogeneous client data. Local models are prone to overfitting their local data distributions, and even transferable features can be distorted during aggregation. To address these challenges, we propose FedCONST, an approach that adaptively modulates update magnitudes based on the global model’s parameter strength. This prevents overemphasizing well-learned parameters while reinforcing underdeveloped ones. Specifically, FedCONST employs linear convex constraints to ensure training stability and preserve locally learned generalization capabilities during aggregation. A Gradient Signal-to-Noise Ratio (GSNR) analysis further validates FedCONST's effectiveness in enhancing feature transferability and robustness. As a result, FedCONST effectively aligns local and global objectives, mitigating overfitting and promoting stronger generalization across diverse FL environments, achieving state-of-the-art performance.
Paperid:674
Authors:Dulhan Jayalath · Gilad Landau · Brendan Shillingford · Mark Woolrich · ʻŌiwi Parker Jones
Abstract:
The past few years have seen remarkable progress in the decoding of speech from brain activity, primarily driven by large singlesubject datasets. However, due to individual variation, such as anatomy, and differences in task design and scanning hardware, leveraging data across subjects and datasets remains challenging. In turn, the field has not benefited from the growing number of open neural data repositories to exploit large-scale deep learning. To address this, we develop neuroscience-informed self-supervised objectives, together with an architecture, for learning from heterogeneous brain recordings. Scaling to nearly400 hoursof MEG data and900 subjects, our approach shows generalisation across participants, datasets, tasks, and even tonovelsubjects. It achievesimprovements of 15-27%over state-of-the-art models andmatchessurgicaldecoding performance withnon-invasivedata. These advances unlock the potential for scaling speech decoding models beyond the current frontier.
Paperid:675
Authors:Zihan Tan · Suyuan Huang · Guancheng Wan · Wenke Huang · He Li · Mang Ye
Abstract:
Abstract:Federated Graph Learning (FGL) combines the privacypreserving capabilities of federated learning (FL) with the strong graph modeling capability of Graph Neural Networks (GNNs).Current research addresses the subgraph-FL based on pure structures, neglecting the propagation of graph signals on spatial and spectral domains of the structure. From a spatial perspective, subgraph-FL introduces edge disconnections between clients, leading to disruptions in label signals and a degradation in the class knowledge of the global GNN.From the spectral perspective, spectral heterogeneity causes inconsistencies in signal frequencies across clients, which makes local GNNs overfit the local signal propagation schemes. As a result, spectral client drifts occur, undermining global generalizability.To tackle the challenges, we propose a global knowledge repository to mitigate label signal disruption and a frequency alignment to address spectral client drifts. The combination of $\textbf{S}$patial and $\textbf{S}$pectral strategies form our framework $S^2$FGL. Extensive experiments demonstrate the superiority of $S^2$FGL.
Paperid:676
Authors:Zhang Chengqiang · Yanghui Rao · Qing Li · Detian Zhang · Chunjiang Zhu
Abstract:
The isomorphism problem is a key challenge in both graph and hypergraph domains, crucial for applications like protein design, chemical pathways, and community detection.Hypergraph isomorphism, which models highorder relationships in real-world scenarios, remains underexplored compared to the graph isomorphism.Current algorithms for hypergraphs, like the 1-dimensional generalized Weisfeiler-Lehman test (1-GWL), lag behind advancements in graph isomorphism tests, limiting most hypergraph neural networks to 1-GWL's expressive power.To address this, we propose the high-dimensional GWL (k-GWL), generalizing k-WL from graphs to hypergraphs.We prove that k-GWL reduces to k-WL for simple graphs, and thus develop a unified isomorphism method for both graphs and hypergraphs. We also successfully establish a clear and complete understanding of the GWL hierarchy of expressivity, showing that (k+1)-GWL is more expressive than k-GWL with illustrative examples.Based on k-GWL, we develop a hypergraph neural network model named k-HNN with improved expressive power of k-GWL, which achieves superior performance on real-world datasets, including a 6\% accuracy improvement on the Steam-Player dataset over the runner-up.
Paperid:677
Authors:Ian Magnusson · Tai Nguyen · David Heineman · Jena Hwang · Luca Soldaini · Akshita Bhagia · Jiacheng Liu · Dirk Groeneveld · Oyvind Tafjord · Noah Smith · Pang Wei Koh · Ben Bogin · Jesse Dodge
Abstract:
Because large language models are expensive to pretrain on different corpora, the ability to use smaller scale experiments to decide on data is crucial for reducing costs. This relies on data giving better small scale performance to also do so at larger scale. We test what methods best detect early signal with controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens and model sizes up to 1B parameters. We release models, data, and evaluations in our Data Differences over Scale (DataDos) Suite as the most extensive openly available sweep of data decisions over scales and random seeds. We find predictions based on experiments at single, rather than multiple, scales are most efficient. Even 150M models trained with < 2% compute of 1B targets correctly predict 80% of comparisons. As most corpora scale with similar steepness, few scaling trends “crossover” even spanning an order of magnitude of compute. However extrapolating beyond our observations we see that more trends cross as the span of compute grows. We also identify that, among 10 multiple choice benchmarks, MMLU and ARC are highly predictable from just 150M parameter experiments, while others like CSQA require greater compute.
Paperid:678
Authors:JianFeng Cai · Zhuozhi XIAN · Jiaxi Ying
Abstract:
Abstract:We explore the singlespiked covariance model within the context of sparse principal component analysis (PCA), which aims to recover a sparse unit vector from noisy samples. From an information-theoretic perspective, $O(k \log p)$ observations are sufficient to recover a $k$-sparse $p$-dimensional vector $\mathbf{v}$. However, existing polynomial-time methods require at least $O(k^2)$ samples for successful recovery, highlighting a significant gap in sample efficiency. To bridge this gap, we introduce a novel thresholding-based algorithm that requires only $\Omega(k \log p)$ samples, provided the signal strength $\lambda = \Omega(||\mathbf{v}||_\infty^{-1})$. We also propose a two-stage nonconvex algorithm that further enhances estimation performance. This approach integrates our thresholding algorithm with truncated power iteration, achieving the minimax optimal rate of statistical error under the desired sample complexity. Numerical experiments validate the superior performance of our algorithms in terms of estimation accuracy and computational efficiency.
Paperid:679
Authors:Zeyu Tang · Zhenhao Chen · Loka Li · Xiangchen Song · Yunlong Deng · Yifan Shen · Guangyi Chen · Peter Spirtes · Kun Zhang
Abstract:
The autoregressive approach to text generation in large language models (LLMs), while widely used, is inherently suboptimal due to the lack of a builtin mechanism to perform refinement and/or correction of the generated content. In this paper, we consider optimality in terms of the joint probability over the generated response, when jointly considering all tokens at the same time. We theoretically characterize the potential deviation of the autoregressively generated response from its globally optimal counterpart that is of the same length. Our analysis suggests that we need to be cautious when noticeable uncertainty arises during text generation, which may signal the sub-optimality of the generation history. To address the pitfall of autoregressive text generation, we propose an approach that incorporates a sliding reflection window and a pausing criterion, such that refinement and generation can be carried out interchangeably as the decoding proceeds. Our selective refinement framework strikes a balance between efficiency and optimality, and our extensive experimental results demonstrate the effectiveness of our approach.
Paperid:680
Authors:Junze Deng · Qinhang Wu · Peizhong Ju · Sen Lin · Yingbin LIANG · Ness Shroff
Abstract:
Rehearsalbased methods have shown superior performance in addressing catastrophic forgetting in continual learning (CL) by storing and training on a subset of past data alongside new data in current task. While such a concurrent rehearsal strategy is widely used, it remains unclear if this approach is always optimal. Inspired by human learning, where sequentially revisiting tasks helps mitigate forgetting, we explore whether sequential rehearsal can offer greater benefits for CL compared to standard concurrent rehearsal. To address this question, we conduct a theoretical analysis of rehearsal-based CL in overparameterized linear models, comparing two strategies: 1) Concurrent Rehearsal, where past and new data are trained together, and 2) Sequential Rehearsal, where new data is trained first, followed by revisiting past data sequentially. By explicitly characterizing forgetting and generalization error, we show that sequential rehearsal performs better when tasks are less similar. These insights further motivate a novel Hybrid Rehearsal method, which trains similar tasks concurrently and revisits dissimilar tasks sequentially. We characterize its forgetting and generalization performance, and our experiments with deep neural networks further confirm that the hybrid approach outperforms standard concurrent rehearsal. This work provides the first comprehensive theoretical analysis of rehearsal-based CL.
Paperid:681
Authors:Zhixiong Zhuang · Hui-Po Wang · Maria-Irina Nicolae · Mario Fritz
Abstract:
Model stealing poses a significant security risk in machine learning by enabling attackers to replicate a blackbox model without access to its training data, thus jeopardizing intellectual property and exposing sensitive information.Recent methods that use pre-trained diffusion models for data synthesis improve efficiency and performance but rely heavily on manually crafted prompts, limiting automation and scalability, especially for attackers with little expertise.To assess the risks posed by open-source pre-trained models, we propose a more realistic threat model that eliminates the need for prompt design skills or knowledge of class names.In this context, we introduce Stealix, the first approach to perform model stealing without predefined prompts. Stealix uses two open-source pre-trained models to infer the victim model’s data distribution, and iteratively refines prompts through a genetic algorithm, progressively improving the precision and diversity of synthetic images.Our experimental results demonstrate that Stealix significantly outperforms other methods, even those with access to class names or fine-grained prompts, while operating under the same query budget. These findings highlight the scalability of our approach and suggest that the risks posed by pre-trained generative models in model stealing may be greater than previously recognized.
Paperid:682
Authors:William English · Dominic Simon · Sumit Jha · Rickard Ewetz
Abstract:
Translating natural language (NL) into a formal language such as temporal logic (TL) is integral for human communication with robots and autonomous systems. Stateof-the-art approaches decompose the task into a grounding of atomic propositions (APs) phase and a translation phase. However, existing methods struggle with accurate grounding, the existence of co-references, and learning from limited data. In this paper, we propose a framework for NL to TL translation called Grammar Forced Translation (GraFT). The framework is based on the observation that previous work solves both the grounding and translation steps by letting a language model iteratively predict tokens from its full vocabulary. In contrast, GraFT reduces the complexity of both tasks by restricting the set of valid output tokens from the full vocabulary to only a handful in each step. The solution space reduction is obtained by exploiting the unique properties of each problem. We also provide a theoretical justification for why the solution space reduction leads to more efficient learning. We evaluate the effectiveness of GraFT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that GraFT improves the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.
Paperid:683
Authors:Ibrahim Alabdulmohsin · Andreas Steiner
Abstract:
Language exhibits a fractal structure in its informationtheoretic complexity (i.e. bits per token), with self-similarity across scales and long-range dependence (LRD). In this work, we investigate whether large language models (LLMs) can replicate such fractal characteristics and identify conditions-such as temperature setting and prompting method-under which they may fail. Moreover, we find that the fractal parameters observed in natural language are contained within a narrow range, whereas those of LLMs' output vary widely, suggesting that fractal parameters might prove helpful in detecting a non-trivial portion of LLM-generated texts. Notably, these findings, and many others reported in this work, are robust to the choice of the architecture; e.g. Gemini 1.0 Pro, Mistral-7B and Gemma-2B. We also release a dataset comprising of over 240,000 articles generated by various LLMs (both pretrained and instruction-tuned) with different decoding temperatures and prompting methods, along with their corresponding human-generated texts. We hope that this work highlights the complex interplay between fractal properties, prompting, and statistical mimicry in LLMs, offering insights for generating, evaluating and detecting synthetic texts.
Paperid:684
Authors:Patara Trirat · Wonyong Jeong · Sung Ju Hwang
Abstract:
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline, such as optimal model search and hyperparameter tuning. Existing AutoML systems often require technical expertise to set up complex tools, which is in general timeconsuming and requires a large amount of human effort. Therefore, recent works have started exploiting large language models (LLM) to lessen such burden and increase the usability of AutoML frameworks via a natural language interface, allowing non-expert users to build their data-driven solutions. These methods, however, are usually designed only for a particular process in the AI development pipeline and do not efficiently use the inherent capacity of the LLMs. This paper proposesAutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML, i.e., from data retrieval to model deployment.AutoML-Agenttakes user's task descriptions, facilitates collaboration between specialized LLM agents, and delivers deployment-ready models. Unlike existing work, instead of devising a single plan, we introduce a retrieval-augmented planning strategy to enhance exploration to search for more optimal plans. We also decompose each plan into sub-tasks (e.g., data preprocessing and neural network design) each of which is solved by a specialized agent we build via prompting executing in parallel, making the search process more efficient. Moreover, we propose a multi-stage verification to verify executed results and guide the code generation LLM in implementing successful solutions. Extensive experiments on seven downstream tasks using fourteen datasets show thatAutoML-Agentachieves a higher success rate in automating the full AutoML process, yielding systems with good performance throughout the diverse domains.
Paperid:685
Authors:Dilfira Kudrat · Zongxia Xie · Yanru Sun · Tianyu Jia · Qinghua Hu
Abstract:
Timeseries forecasting has gained significant attention in machine learning due to its crucial role in various domains. However, most existing forecasting models rely heavily on point-wise loss functions like Mean Square Error, which treat each time step independently and neglect the structural dependencies inherent in time series data, making it challenging to capture complex temporal patterns accurately. To address these challenges, we propose a novelPatch-wiseStructural (PS) loss, designed to enhance structural alignment by comparing time series at the patch level. Through leveraging local statistical properties, such as correlation, variance, and mean, PS loss captures nuanced structural discrepancies overlooked by traditional point-wise losses. Furthermore, it integrates seamlessly with point-wise loss, simultaneously addressing local structural inconsistencies and individual time-step errors. PS loss establishes a novel benchmark for accurately modeling complex time series data and provides a new perspective on time series loss function design. Extensive experiments demonstrate that PS loss significantly improves the performance of state-of-the-art models across diverse real-world datasets. The data and code are publicly available at: \url{https://anonymous.4open.science/r/PS_Loss/}.
Paperid:686
Authors:Yue Jiang · Yile Chen · Xiucheng Li · Qin Chao · SHUAI LIU · Gao Cong
Abstract:
The fundamental of time series forecasting methodologies is to accurately model complex interdependencies and shared patterns within time series data. Recent advancements, such as SpatioTemporal Graph Neural Networks (STGNNs) and Time Series Foundation Models, have shown promising results by effectively capturing intricate spatial and temporal dependencies across diverse real-world datasets. However, these models typically require substantial training data and often struggle when the available data is limited. In response, we propose a framework named Few-shot Spatio-Temporal Large Language Models (FSTLLM), aimed at enhancing model robustness and predictive performance in scenarios with limited training data. In particular, FSTLLM leverages the contextual knowledge embedded in Large Language Models (LLMs) and provides reasonable and accurate predictions. In addition, existing forecasting models can be seamlessly integrated into FSTLLM to improve their predictive capabilities. Experimental results on two real-world datasets demonstrate the adaptability and consistently superior performance of FSTLLM over major baseline models by a significant margin.
Paperid:687
Authors:Haozheng Luo · Chenghao Qiu · Maojiang Su · Zhihan Zhou · Jerry Yao-Chieh Hu · Zoe Mehta · Guo Ye · Han Liu
Abstract:
To address the challenge of scarce computational resources in genomic modeling, we introduce GERM, a genomic foundation model optimized for accessibility and adaptability. GERM improves upon models like DNABERT2 by eliminating outliers that hinder low-rank adaptation and post-training quantization, enhancing both efficiency and robustness. We replace the vanilla attention layer with an outlier-free mechanism inspired by associative memory models. By removing outliers during both pre-training and fine-tuning, this approach accelerates adaptation, reduces computational costs, and enhances quantization robustness within acceptable loss margins. Additionally, we propose GERM-T, a strategy that employs small-step continual learning within the outlier-free framework, leveraging original checkpoints to avoid retraining from scratch. Empirically, GERM improves fine-tuning performance by 37.98% and quantization by 64.34% over the baseline model. It also reduces average kurtosis by 92.14% and maximum infinity norm by 82.77%. Compared to leading methods, GERM consistently delivers superior performance, offering a practical solution for genomic modeling in resource-constrained settings.
Paperid:688
Authors:Adrian Rodriguez-Munoz · Manel Baradad · Phillip Isola · Antonio Torralba
Abstract:
Abstract:We train representation models with procedural data only, and apply them on visual similarity, classification, and semantic segmentation tasks without further training by using visual memoryan explicit database of reference image embeddings. Unlike prior work on visual memory, our approach achieves full compartmentalization with respect to all real-world images while retaining strong performance. Compared to a model trained on Places, our procedural model performs within 1\% on NIGHTS visual similarity, outperforms by 8\% and 15\% on CUB200 and Flowers102 fine-grained classification, and is within 10\% on ImageNet-1K classification. It also demonstrates strong zero-shot segmentation, achieving an $R^2$ on COCO within 10\% of the models trained on real data. Finally, we analyze procedural versus real data models, showing that parts of the same object have dissimilar representations in procedural models, resulting in incorrect searches in memory and explaining the remaining performance gap.
Paperid:689
Authors:Bohan Lyu · Yadi Cao · Duncan Watson-Parris · Leon Bergen · Taylor Berg-Kirkpatrick · Rose Yu
Abstract:
Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but, even with domainspecific fine-tuning, often produce hallucinations for complex ones. While integrating LLMs with tools can mitigate this reliability issue, models finetuned on tool usage only often over-rely on them, incurring unnecessary costs from resource-intensive scientific tools even for simpler problems.Inspired by how human experts assess the complexity of the problem before choosing the solutions, we propose a novel two-component fine-tuning method, Adapting While Learning (AWL). In the first component, World Knowledge Learning (WKL), LLMs internalize scientific knowledge by learning from tools-generated solutions. In the second component, Tool Usage Adaptation (TUA), we classify questions as easy or hard based on the WKL-trained model’s accuracy, and train it to maintain direct reasoning for simple problems while switching to tools for challenging ones.We validate our method on 6 scientific benchmark datasets in climate science, epidemiology, and mathematics. Compared to the base 8B model, our trained models achieve 28.27\% higher answer accuracy and 13.76\% better tool usage accuracy, even surpassing state-of-the-art models including GPT-4o and Claude-3.5 on 4 custom-created datasets.
Paperid:690
Authors:Shangzhe Li · Zhiao Huang · Hao Su
Abstract:
Imitation learning (IL) enables agents to acquire skills directly from expert demonstrations, providing a compelling alternative to reinforcement learning. However, prior online IL approaches struggle with complex tasks characterized by highdimensional inputs and complex dynamics. In this work, we propose a novel approach to online imitation learning that leverages reward-free world models. Our method learns environmental dynamics entirely in latent spaces without reconstruction, enabling efficient and accurate modeling. We adopt the inverse soft-Q learning objective, reformulating the optimization process in the Q-policy space to mitigate the instability associated with traditional optimization in the reward-policy space. By employing a learned latent dynamics model and planning for control, our approach consistently achieves stable, expert-level performance in tasks with high-dimensional observation or action spaces and intricate dynamics. We evaluate our method on a diverse set of benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating superior empirical performance compared to existing approaches.
Paperid:691
Authors:Ananth Raman · Vinod Raman
Abstract:
We continue to study the learningtheoretic foundations of generation by extending the results from Kleinberg and Mullainathan [2024] and Li et al. [2024] to account for noisy example streams. In the noiseless setting of Kleinberg and Mullainathan [2024] and Li et al. [2024], an adversary picks a hypothesis from a binary hypothesis class and provides a generator with a sequence of its positive examples. The goal of the generator is to eventually output new, unseen positive examples. In the noisy setting, an adversary still picks a hypothesis and a sequence of its positive examples. But, before presenting the stream to the generator, the adversary inserts a finite number of negative examples. Unaware of which examples are noisy, the goal of the generator is to still eventually output new, unseen positive examples. In this paper, we provide necessary and sufficient conditions for when a binary hypothesis class can be noisily generatable. We provide such conditions with respect to various constraints on the number of distinct examples that need to be seen before perfect generation of positive examples. Interestingly, for finite and countable classes we show that generatability is largely unaffected by the presence of a finite number of noisy examples.
Paperid:692
Authors:Weizhong Huang · Yuxin Zhang · Xiawu Zheng · Fei Chao · Rongrong Ji
Abstract:
Abstract:In this paper, we address the challenge of determining the layerwise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of **"reconstruction error explosion"** in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70% sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50%, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively. Code is available in the supplementary material.
Paperid:693
Authors:Farzad Farhadzadeh · Debasmit Das · Shubhankar Borse · Fatih Porikli
Abstract:
We introduce ProLoRA, enabling zeroshot adaptation of parameter-efficient fine-tuning in text-to-image diffusion models. ProLoRA transfers pre-trained low-rank adjustments (e.g., LoRA) from a source to a target model without additional training data. This overcomes the limitations of traditional methods that require retraining when switching base models, often challenging due to data constraints. ProLoRA achieves this via projection of source adjustments into the target model's weight space, leveraging subspace and null space similarities and selectively targeting aligned layers. Evaluations on established text-to-image models demonstrate successful knowledge transfer and comparable performance without retraining.
Paperid:694
Authors:Zhongyang Li · Ziyue Li · Tianyi Zhou
Abstract:
In large multimodal models (LMMs), the perception of nonlanguage modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by different downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method ''Re-Routing inTest-Time (R2-T2)'' that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and significantly improves state-of-the-art LMMs' performance on challenging multimodal benchmarks of diverse tasks, without training any parameters in the base model. Our code can be accessed here.
Paperid:695
Authors:Zhihai Wang · Jie Wang · Jilai Pan · Xilin Xia · Huiling Zhen · Mingxuan Yuan · Jianye Hao · Feng Wu
Abstract:
Abstract:Treesearch-based reasoning methods have significantly enhanced the reasoning capability of large language models (LLMs) by facilitating the exploration of multiple intermediate reasoning steps, i.e., thoughts. However, these methods suffer from substantial inference latency, as they have to generate numerous reasoning thoughts, severely limiting LLM applicability. To address this challenge, we propose a novel Speculative Search (SpecSearch) framework that significantly accelerates LLM reasoning by optimizing thought generation. Specifically, SpecSearch utilizes a small model to strategically collaborate with a large model at both thought and token levels, efficiently generating high-quality reasoning thoughts. The major pillar of SpecSearch is a novel quality-preserving rejection mechanism, which effectively filters out thoughts whose quality falls below that of the large model's outputs. Moreover, we show that SpecSearch preserves comparable reasoning quality to the large model. Experiments on both the Qwen and Llama models demonstrate that SpecSearch significantly outperforms state-of-the-art approaches, achieving up to 2.12$\times$ speedup with comparable reasoning quality.
Paperid:696
Authors:Philipp Misof · Pan Kessel · Jan Gerken
Abstract:
Abstract:Little is known about the training dynamics of equivariant neural networks, in particular how it compares to data augmented training of their nonequivariant counterparts. Recently, neural tangent kernels (NTKs) have emerged as a powerful tool to analytically study the training dynamics of wide neural networks. In this work, we take an important step towards a theoretical understanding of training dynamics of equivariant models by deriving neural tangent kernels for a broad class of equivariant architectures based on group convolutions. As a demonstration of the capabilities of our framework, we show an interesting relationship between data augmentation and group convolutional networks. Specifically, we prove that they share the same expected prediction at all training times and even off-manifold. In this sense, they have the same training dynamics. We demonstrate in numerical experiments that this still holds approximately for finite-width ensembles. By implementing equivariant NTKs for roto-translations in the plane ($G=C_{n}\ltimes\mathbb{R}^{2}$) and 3d rotations ($G=\mathrm{SO}(3)$), we show that equivariant NTKs outperform their non-equivariant counterparts as kernel predictors for histological image classification and quantum mechanical property prediction.
Paperid:697
Authors:Jiawei Tang · Yuheng Jia
Abstract:
Label distribution learning (LDL) is an effective method to predict the relative label description degree (a.k.a. label distribution) of a sample. However, the label distribution is not a complete representation of an instance because it overlooks the absolute intensity of each label. Specifically, it's impossible to obtain the total description degree of hidden labels that not in the label space, which leads to the loss of information and confusion in instances. To solve the above problem, we come up with a new concept named background concentration to serve as the absolute description degree term of the label distribution and introduce it into the LDL process, forming the improved paradigm of concentration distribution learning. Moreover, we propose a novel model by probabilistic methods and neural networks to learn label distributions and background concentrations from existing LDL datasets. Extensive experiments prove that the proposed approach is able to extract background concentrations from label distributions while producing more accurate prediction results than the stateof-the-art LDL methods. The code is available in the supplementary material.
Paperid:698
Authors:Tingyu Zhu · Haoyu Liu · Ziyu Wang · Zhimin Jiang · Zeyu Zheng
Abstract:
Developing generative models to create or conditionally create symbolic music presents unique challenges due to the combination of limited data availability and the need for high precision in note pitch. To address these challenges, we introduce an efficient FineGrained Guidance (FGG) approach within diffusion models. FGG guides the diffusion models to generate music that aligns more closely with the control and intent of expert composers, which is critical to improve the accuracy, listenability, and quality of generated music. This approach empowers diffusion models to excel in advanced applications such as improvisation, and interactive music creation. We derive theoretical characterizations for both the challenges in symbolic music generation and the effects of the FGG approach. We provide numerical experiments and subjective evaluation to demonstrate the effectiveness of our approach. We have published a demo page to showcase performances, as one of the first in the symbolic music literature's demo pages that enables real-time interactive generation.
Paperid:699
Authors:Xiaowei Chi · Hengyuan Zhang · Chun-Kai Fan · Xingqun Qi · Rongyu Zhang · Anthony Chen · Chi-Min Chan · Wei Xue · Qifeng Liu · Shanghang Zhang · Yike Guo
Abstract:
Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios. However, existing models often lack robust understanding, limiting their ability to perform multistep predictions or handle Out-of-Distribution (OOD) scenarios. To address this challenge, we propose the Reflection of Generation (RoG), a set of intermediate reasoning strategies designed to enhance video prediction. It leverages the complementary strengths of pre-trained vision-language and video generation models, enabling them to function as a world model in embodied scenarios. To support RoG, we introduce Embodied Video Anticipation Benchmark(EVA-Bench), a comprehensive benchmark that evaluates embodied world models across diverse tasks and scenarios, utilizing both in-domain and OOD datasets. Building on this foundation, we devise a world model, Embodied Video Anticipator (EVA), that follows a multistage training paradigm to generate high-fidelity video frames and apply an autoregressive strategy to enable adaptive generalization for longer video sequences. Extensive experiments demonstrate the efficacy of EVA in various downstream tasks like video generation and robotics, thereby paving the way for large-scale pre-trained models in real-world video prediction applications. The video demos are available at \hyperlink{https://sites.google.com/view/icml-eva}{https://sites.google.com/view/icml-eva}.
Paperid:700
Authors:Xinyue Zeng · Haohui Wang · Junhong Lin · Jun Wu · Tyler Cody · Dawei Zhou
Abstract:
The proliferation of opensourced Large Language Models (LLMs) and diverse downstream tasks necessitates efficient model selection, given the impracticality of fine-tuning all candidates due to computational constraints. Despite the recent advances in LLM selection, a fundamental research question largely remains nascent: how can we model the dynamic behaviors of LLMs during fine-tuning, thereby enhancing our understanding of their generalization performance across diverse downstream tasks? In this work, we propose a novel theoretical framework that provides a proper lens to assess the generalization capabilities of LLMs, thereby enabling accurate and efficient LLM selection for downstream applications. In particular, we first derive a Hessian-based PAC-Bayes generalization bound that unveils fine-tuning dynamics of LLMs and then introduce LensLLM, a Neural Tangent Kernel(NTK)-based Rectified Scaling Model that enables accurate performance predictions across diverse tasks while maintaining computational efficiency. Extensive empirical results on 3 large-scale benchmarks demonstrate that our model achieves up to 91.1% accuracy and reduces up to 88.5% computational cost in LLM selection, outperforming 5 state-of-the-art methods. We open-source our proposed LensLLM model and corresponding results at this anonymous link: https://anonymous.4open.science/r/LENSLLM-3B1E/.
Paperid:701
Authors:Xuelin Shen · Jiayin Xu · Kangsheng Yin · Wenhan Yang
Abstract:
The rapid development of visionlanguage pretrained (VLP) models has raised significant concerns about user privacy and copyright, as these models can easily identify the semantic content of cloud-stored images or retrieve corresponding images through textual descriptions. Under this circumstance, this paper makes the first attempt to protect users' privacy and copyright from the perspective of image compression. Specifically, we propose a flexible coding paradigm capable of generating bitstreams with polymorphic representations. Under the default setting, the bitstream would be decoded into an encrypted version that maintains satisfactory perceptual quality but cannot be accessed by common VLP models (e.g., CLIP and BLIP). Meanwhile, with provided personalized global keywords as an additional condition, the proposed scheme can reconstruct a full version of the image identical to those decoded by common image codecs. This paper further incorporates an adaptive multi-objective optimization strategy to obtain improved encrypting performance and perceptual quality simultaneously within a single training process. The proposed scheme is plug-and-play and can be seamlessly integrated into most existing learned image compression (LIC) models. Extensive experiments demonstrate that the proposed scheme effectively misleads VLP-based applications, including text–image retrieval and image description, as well as common machine vision tasks such as image classification and object detection, while maintaining the same rate–distortion performance as baseline LIC models.
Paperid:702
Authors:Yaoyang Liu · Junlin Li · Yinjun Wu · Zhen Chen
Abstract:
Although MultiVector Retrieval (MVR) has achieved the state of the art on many information retrieval (IR) tasks, its performance highly depends on how to decompose queries into smaller pieces, say phrases or tokens. However, optimizing query decomposition for MVR performance is not end-to-end differentiable. Even worse, jointly solving this problem and training the downstream retrieval-based systems, say RAG systems could be highly inefficient. To overcome these challenges, we propose Performance-Oriented Query Decomposer (POQD), a novel query decomposition framework for MVR. POQD leverages one LLM for query decomposition and searches the optimal prompt with an LLM-based optimizer. We further propose an end-to-end training algorithm to alternatively optimize the prompt for query decomposition and the downstream models. This algorithm can achieve superior MVR performance at a reasonable training cost as our theoretical analysis suggests. POQD can be integrated seamlessly into arbitrary retrieval-based systems such as Retrieval-Augmented Generation (RAG) systems. Extensive empirical studies on representative RAG-based QA tasks show that POQD outperforms existing query decomposition strategies in both retrieval performance and end-to-end QA accuracy.
Paperid:703
Authors:Yinghao Li · Rithesh Kumar · Zeyu Jin
Abstract:
Diffusion models have demonstrated significant potential in speech synthesis tasks, including textto-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that prevent true end-to-end optimization with perceptual metrics. We introduce DMOSpeech, a distilled diffusion-based TTS model that uniquely achieves both faster inference and superior performance compared to its teacher model. By enabling direct gradient pathways to all model components, we demonstrate the first successful end-to-end optimization of differentiable metrics in TTS, incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss. Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude. This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization. The audio samples are available at https://dmospeech.github.io/demo
Paperid:704
Authors:Lahav Dabah · Tom Tirer
Abstract:
In many classification applications, the prediction of a deep neural network (DNN) based classifier needs to be accompanied by some confidence indication. Two popular approaches for that aim are: 1)Calibration: modifies the classifier's softmax values such that the maximal value better estimates the correctness probability; and 2)Conformal Prediction(CP): produces a prediction set of candidate labels that contains the true label with a userspecified probability, guaranteeing marginal coverage but not, e.g., per class coverage. In practice, both types of indications are desirable, yet, so far the interplay between them has not been investigated. Focusing on the ubiquitousTemperature Scaling(TS) calibration, we start this paper with an extensive empirical study of its effect on prominent CP methods. We show that while TS calibration improves the class-conditional coverage of adaptive CP methods, surprisingly, it negatively affects their prediction set sizes. Motivated by this behavior, we explore the effect of TS on CPbeyond its calibration applicationand reveal an intriguing trend under which it allows to trade prediction set size and conditional coverage of adaptive CP methods. Then, we establish a mathematical theory that explains the entire non-monotonic trend. Finally, based on our experiments and theory, we offer simple guidelines for practitioners to effectively combine adaptive CP with calibration.
Paperid:705
Authors:Marina Mancoridis · Keyon Vafa · Bec Weeks · Sendhil Mullainathan
Abstract:
This paper documents an overlooked yet pervasive category of failure in large language models (LLMs):potemkin understanding. Potemkins are cases where a model’s misunderstanding of a concept does not align with the way humans would have misconceived it. To quantify this phenomenon, we develop a framework that measures potemkin understanding through the disconnect between a model’s ability toexplainanduseconcepts. Through a series of experiments spanning diverse domains—literary techniques, game theory, and psychological biases—we demonstrate the ubiquity of potemkins. Despite achieving 97.7% accuracy in explaining concepts, LLMs display only 67.9% accuracy in using them, with potemkins consistently appearing across models, tasks, and domains. We develop selfcoherence tests to show these failures represent not merely incorrect learning but a deeper absence of concept coherence, while also finding that existing benchmarks do not adequately test for potemkins.
Paperid:706
Authors:Youran Dong · Junfeng Yang · Wei Yao · Jin Zhang
Abstract:
Bilevel optimization is a powerful tool for many machine learning problems, such as hyperparameter optimization and metalearning. Estimating hypergradients (also known as implicit gradients) is crucial for developing gradient-based methods for bilevel optimization. In this work, we propose a computationally efficient technique for incorporating curvature information into the approximation of hypergradients and present a novel algorithmic framework based on the resulting enhanced hypergradient computation. We provide convergence rate guarantees for the proposed framework in both deterministic and stochastic scenarios, particularly showing improved computational complexity over popular gradient-based methods in the deterministic setting. This improvement in complexity arises from a careful exploitation of the hypergradient structure and the inexact Newton method. In addition to the theoretical speedup, numerical experiments demonstrate the significant practical performance benefits of incorporating curvature information.
Paperid:707
Authors:Jiazhong Cen · Xudong Zhou · Jiemin Fang · Changsong Wen · Lingxi Xie · xiaopeng zhang · Wei Shen · Qi Tian
Abstract:
Recent advancements in 3D Gaussian Splatting (3DGS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints—a phenomenon we termview-dependent semantics. To address this challenge, we proposeLaGa(LanguageGaussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of+18.7%mIoU over the previous SOTA on the LERF-OVS dataset. Our code will be released.
Paperid:708
Authors:Hanshi Sun · Li-Wen Chang · Wenlei Bao · Size Zheng · Ningxin Zheng · Xin Liu · Harry Dong · Yuejie Chi · Beidi Chen
Abstract:
Abstract:With the widespread deployment of longcontext large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for decoding both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to accelerate inference while maintaining generation quality, they either fail to sufficiently reduce GPU memory usage or introduce significant decoding latency by offloading the KV cache to the CPU. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. To minimize decoding latency, ShadowKV employs an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly. By evaluating ShadowKV on benchmarks like RULER, LongBench, and models such as Llama-3.1-8B and GLM-4-9B-1M, we demonstrate that it achieves up to 6$\times$ larger batch sizes and 3.04$\times$ higher throughput on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory.
Paperid:709
Authors:Yicheng Xiao · Lin Song · Rui Yang · Cheng Cheng · Yixiao Ge · Xiu Li · Ying Shan
Abstract:
Recent advances have highlighted the benefits of scaling language models to enhance performance across a wide range of NLP tasks. However, these approaches still face limitations in effectiveness and efficiency when applied to domainspecific tasks, particularly for small edge-side models. We propose the LoRA-Gen framework, which utilizes a large cloud-side model to generate LoRA parameters for edge-side models based on task descriptions. By employing the reparameterization technique, we merge the LoRA parameters into the edge-side model to achieve flexible specialization. Our method facilitates knowledge transfer between models while significantly improving the inference efficiency of the specialized model by reducing the input context length. Without specialized training, LoRA-Gen outperforms conventional LoRA fine-tuning, which achieves competitive accuracy and a 2.1x speedup with TinyLLaMA-1.1B in reasoning tasks.Besides, our method delivers a compress ratio of 10.1x with Gemma-2B on intelligent agent tasks.
Paperid:710
Authors:Jihoon Chung · Tyler Zhu · Max Gonzalez Saez-Diez · Juan Carlos Niebles · Honglu Zhou · Olga Russakovsky
Abstract:
Recent advances in vision backbones have yielded powerful and diverse visual and video encoders. Yet, current Video Large Language Models encode visual inputs using an encoder from a single backbone family, limiting the amount and type of visual information they can process. We propose MERV, a MultiEncoder Video Representation, which utilizes multiple encoders for a comprehensive video representation. To optimize heterogeneous features from a broad spectrum of encoders and ensure efficient and coherent feature integration, MERV first aligns encoder features spatio-temporally, then projects them into a unified structure, and finally fuses them through cross-attention. Under fair comparison, MERV achieves up to 4.62% higher accuracy than its base model, while introducing minimal extra parameters and training faster than equivalent single-encoder methods after parallelizing visual processing. Qualitative analysis shows MERV successfully captures and integrates domain knowledge from each encoder, opening new possibilities for scaling enhanced video understanding.
Paperid:711
Authors:Dong Xiao · Guangyao Chen · Peixi Peng · Yangru Huang · Yifan Zhao · Yongxing Dai · Yonghong Tian
Abstract:
Anomaly detection is essential for the safety and reliability of autonomous driving systems. Current methods often focus on detection accuracy but neglect response time, which is critical in timesensitive driving scenarios. In this paper, we introduce real-time anomaly detection for autonomous driving, prioritizing both minimal response time and high accuracy. We propose a novel multimodal asynchronous hybrid network that combines event streams from event cameras with image data from RGB cameras. Our network utilizes the high temporal resolution of event cameras through an asynchronous Graph Neural Network and integrates it with spatial features extracted by a CNN from RGB images. This combination effectively captures both the temporal dynamics and spatial details of the driving environment, enabling swift and precise anomaly detection. Extensive experiments on benchmark datasets show that our approach outperforms existing methods in both accuracy and response time, achieving millisecond-level real-time performance.
Paperid:712
Authors:Clément Lalanne · Franck Iutzeler · Loubes Jean-Michel · Julien Chhor
Abstract:
Abstract:Estimating optimal transport maps between two distributions from respective samples is an important element for many machine learning methods. To do so, rather than extending discrete transport maps, it has been shown that estimating the Brenier potential of the transport problem and obtaining a transport map through its gradient is near minimax optimal for smooth problems. In this paper, we investigate the private estimation of such potentials and transport maps with respect to the distribution samples.We propose a differentially private transport map estimator with $L^2$ error at most $n^{1} \vee n^{-\frac{2 \alpha}{2 \alpha - 2 + d}} \vee (n\epsilon)^{-\frac{2 \alpha}{2 \alpha + d}} $ up do polylog terms where $n$ is the sample size, $\epsilon$ is the desired level of privacy, $\alpha$ is the smoothness of the true transport map, and $d$ is the dimension of the feature space. We also provide a lower bound for the problem.
Paperid:713
Authors:Tianren Zhang · Guanyu Chen · Feng Chen
Abstract:
Humans developworld modelsthat capture the underlying generation process of data. Whether neural networks can learn similar world models remains an open problem. In this work, we provide the first theoretical results for this problem, showing that in amultitasksetting, models with alow-degree biasprovably recover latent data-generating variables under mild assumptions---even if proxy tasks involve complex, non-linear functions of the latents. However, such recovery is also sensitive to model architecture. Our analysis leverages Boolean models of task solutions via the Fourier-Walsh transform and introduces new techniques for analyzing invertible Boolean transforms, which may be of independent interest. We illustrate the algorithmic implications of our results and connect them to related research areas, including self-supervised learning, out-of-distribution generalization, and the linear representation hypothesis in large language models.
Paperid:714
Authors:shuai wang · Zexian Li · Qipeng zhang · Tianhui Song · Xubin Li · Tiezheng Ge · Bo Zheng · Limin Wang
Abstract:
Abstract:Diffusion models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODEbased solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal for diffusion model and reveal a compact search space comprised of time steps and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify more optimal solver. Equipped with the searched solver, rectified-flow models, e.g., SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2.40 and 2.35, respectively, on ImageNet-$256\times256$ with only 10 steps. Meanwhile, DDPM model, DiT-XL/2, reaches a FID score of 2.33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates generality across various model architectures, resolutions, and model sizes.
Paperid:715
Authors:Jacopo Talpini · Marco Savi · Giovanni Neglia
Abstract:
OneShot Federated Learning (FL) is a recent paradigm that enables multiple clients to cooperatively learn a global model in a single round of communication with a central server. In this paper, we analyze the One-Shot FL problem through the lens of Bayesian inference and propose FedBEns, an algorithm that leverages the inherent multimodality of local loss functions to find better global models.Our algorithm leverages a mixture of Laplace approximations for the clients' local posteriors, which the server then aggregates to infer the global model. We conduct extensive experiments on various datasets, demonstrating that the proposed method outperforms competing baselines that typically rely on unimodal approximations of the local losses.
Paperid:716
Authors:Mirco Mutti · Jeongyeol Kwon · Aviv Tamar · Shie Mannor
Abstract:
Abstract:Contextual multiarmed bandits are a popular choice to model sequential decision-making. *E.g.*, in a healthcare application we may perform various tests to asses a patient condition (exploration) and then decide on the best treatment to give (exploitation). When human design strategies, they aim for the exploration to be *fast*, since the patient's health is at stake, and easy to *interpret* for a physician overseeing the process. However, common bandit algorithms are nothing like that: The regret caused by exploration scales with $\sqrt{H}$ over $H$ rounds and decision strategies are based on opaque statistical considerations. In this paper, we use an original *classification view* to meta learn interpretable and fast exploration plans for a fixed collection of bandits $\mathbb{M}$. The plan is prescribed by an interpretable *decision tree* probing decisions' payoff to classify the test bandit. The test regret of the plan in the *stochastic* and *contextual* setting scales with $O (\lambda^{-2} C_{\lambda} (\mathbb{M}) \log^2 (MH))$, being $M$ the size of $\mathbb{M}$, $\lambda$ a separation parameter over the bandits, and $C_\lambda (\mathbb{M})$ a novel *classification-coefficient* that fundamentally links meta learning bandits with classification. Through a nearly matching lower bound, we show that $C_\lambda (\mathbb{M})$ inherently captures the complexity of the setting.
Paperid:717
Authors:Tianyuan Zou · Yang Liu · Peng Li · Yufei Xiong · Jianqing Zhang · Jingjing Liu · Xiaozhou Ye · Ye Ouyang · Ya-Qin Zhang
Abstract:
Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble highquality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private dataSynthesis viaWeighted multiplePre-trained generative models framework, named asWASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Qvoting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models. Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://anonymous.4open.science/r/WASP.
Paperid:718
Authors:Santiago Cortes-Gomez · Naveen Raman · Aarti Singh · Bryan Wilder
Abstract:
Randomized controlled trials (RCTs) generate guarantees for treatment effects. However, RCTs often spend unnecessary resources exploring suboptimal treatments, which can reduce the power of treatment guarantees. To address this, we propose a two-stage RCT design. In the first stage, a data-driven screening procedure prunes low-impact treatments, while the second stage focuses on developing high-probability lower bounds for the best-performing treatment effect. Unlike existing adaptive RCT frameworks, our method is simple enough to be implemented in scenarios with limited adaptivity.We derive optimal designs for two-stage RCTs and demonstrate how such designs can be implemented through sample splitting.Empirically, we demonstrate that two-stage designs improve upon single-stage approaches, especially for scenarios where domain knowledge is available through a prior. Our work is thus, a simple yet effective design for RCTs, optimizing for the ability to certify with high probability the largest possible treatment effect for at least one of the arms studied.
Paperid:719
Authors:Alan Amin · Andres Potapczynski · Andrew Wilson
Abstract:
To predict and understand the causes of disease, geneticists build models that predict how a genetic variant impacts phenotype from genomic features. There is a vast amount of data available from the large projects that have sequence hundreds of thousands of genomes; yet, stateof-the-art models, like LD score regression, cannot leverage this data as they lack flexibility due to their simplifying assumptions. These models use simplifying assumptions to avoid solving the large linear algebra problems introduced by the genomic correlation matrices. In this paper, we leverage modern fast linear algebra techniques to develop WASP (genome Wide Association Studies with Preconditioned iteration), a method to train large and flexible neural network models. On semi-synthetic and real data we show that WASP better predicts phenotype and better recovers its functional causes compared to LD score regression. Finally, we show that training larger WASP models on larger data leads to better explanations of phenotypes.
Paperid:720
Authors:Zihan Zhang · Yuxin Chen · Jason Lee · Simon Du · Ruosong Wang
Abstract:
Abstract:In this work, we study reinforcement learning (RL) with trajectory feedback. Compared to the standard RL setting, in RL with trajectory feedback, the agent only observes the accumulative reward along the trajectory, and therefore, this model is particularly suitable for scenarios where querying the reward in each single step incurs prohibitive cost. For a finitehorizon Markov Decision Process (MDP) with $S$ states, $A$ actions and a horizon length of $H$, we develop an algorithm that enjoys an asymptotically nearly optimal regret of $\tilde{O}\left(\sqrt{SAH^3K}\right)$ in $K$ episodes. To achieve this result, our new technical ingredients include (i) constructing a tighter confidence region for the reward function by incorporating the RL with trajectory feedback setting with techniques in linear bandits and (ii) constructing a reference transition model to better guide the exploration process.
Paperid:721
Authors:Chao Gao · Liren Shan · Vaidehi Srinivas · Aravindan Vijayaraghavan
Abstract:
Abstract:Conformal Prediction is a widely studied technique to construct prediction sets of future observations. Most conformal prediction methods focus on achieving the necessary coverage guarantees, but do not provide formal guarantees on the size (volume) of the prediction sets. We first prove the impossibility of volume optimality where any distributionfree method can only find a trivial solution. We then introduce a new notion of volume optimality by restricting the prediction sets to belong to a set family (of finite VC-dimension), specifically a union of $k$-intervals. Our main contribution is an efficient distribution-free algorithm based on dynamic programming (DP) to find a union of $k$-intervals that is guaranteed for any distribution to have near-optimal volume among all unions of $k$-intervals satisfying the desired coverage property. By adopting the framework of distributional conformal prediction (Chernozhukov et al., 2021), the new DP based conformity score can also be applied to achieve approximate conditional coverage and conditional restricted volume optimality, as long as a reasonable estimator of the conditional CDF is available. While the theoretical results already establish volume-optimality guarantees, they are complemented by experiments that demonstrate that our method can significantly outperform existing methods in many settings.
Paperid:722
Authors:Negin Golrezaei · Sourav Sahoo
Abstract:
Abstract:Repeated uniform price multiunit auctions are widely prevalent in emissions permit auctions, Treasury auctions, and energy markets, where identical goods are sold at the same per-unit price. We study the bidding problem from the perspective of a single *value-maximizing* buyer who aims to maximize their cumulative value over $T$ rounds while adhering to return-on-investment (RoI) constraint in each round. Buyers adopt $m$-*uniform bidding* format, where they submit $m$ bid-quantity pairs $(b_i, q_i)$ to demand $q_i$ units at bid $b_i$.We introduce the notion of *safe* bidding strategies that ensure RoI constraints are satisfied in each auction regardless of the competing bids. Then, we characterize a finite class of safe strategies, determined solely by the bidder’s valuation curve. While the number of strategies in this class is exponential in $m$, we develop a polynomial-time algorithm to learn the optimal safe strategy that achieves sublinear regret in the online setting, where regret is measured against a clairvoyant benchmark that knows the competing bids *a priori* and selects the fixed hindsight optimal safe strategy. We then evaluate the performance of safe strategies against a clairvoyant that selects the hindsight optimal strategy from a richer class of strategies in the online setting. In such scenarios, we compute the optimal *richness ratio*, $\alpha\in(0, 1]$ for the class of strategies chosen by the clairvoyant and show that our algorithm, designed to learn safe strategies, achieves $\alpha$-approximate sublinear regret against these stronger benchmarks. Experiments on semi-synthetic data from real-world auctions show that safe strategies substantially outperform the derived theoretical bounds. The strong empirical performance of safe bidding strategies, combined with their intuitive characterization, makes them quite appealing in practice.
Paperid:723
Authors:Long Zhao · Sanghyun Woo · Ziyu Wan · Yandong li · Han Zhang · Boqing Gong · Hartwig Adam · Xuhui Jia · Ting Liu
Abstract:
Abstract:In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For highdimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. In this work, we offer a new perspective by proposing *denoising as decoding*, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (FID), comparing it to state-of-the-art autoencoding approaches. By adopting iterative reconstruction through diffusion, our autoencoder, namely $\epsilon$-VAE, achieves high reconstruction quality, which in turn enhances downstream generation quality by 22% and provides 2.3$\times$ inference speedup. We hope this work offers new insights into integrating iterative generation and autoencoding for improved compression and generation.
Paperid:724
Authors:Qiang Chen · Zhongze Wu · Xiu Su · Xi Lin · Zhe Qu · Shan You · Shuo Yang · Chang Xu
Abstract:
Group fairness based on adversarial training has gained significant attention on graph data, which was implemented by masking sensitive attributes to generate fair feature views. However, existing models suffer from training instability due to uncertainty of the generated masks and the tradeoff between fairness and utility. In this work, we propose a stable fair Graph Neural Network(SFG) to maintain the training instability while preserving accuracy and fairness performance. Specifically, we first theoretically derive a tight upper Lipschitz bound to control the stability of existing adversarial-based models and employ a stochastic projected subgradient algorithm to constrain the bound, which operates in a block-coordinate manner. Additionally, we construct the uncertainty set to train the model, which can prevent unstable training by dropping some overfitting nodes caused by chasing fairness. Extensive experiments conducted on three real-world datasets demonstrate that SFG is stable and outperforms other state-of-the-art methods in terms of both fairness and utility performance. Codes are available at \href{https://anonymous.4open.science/r/SFG-559B}{https://anonymous.4open.science/r/SFG-559B}.
Paperid:725
Authors:Jan Blechschmidt · Tom-Christian Riemer · Martin STOLL · Max Winkler · Jan-Frederik Pietschmann
Abstract:
We develop a novel physics informed deep learning approach for solving nonlinear driftdiffusion equations on metric graphs. These models represent an important model class with a large number of applications in areas ranging from transport in biological cells to the motion of human crowds. While traditional numerical schemes require a large amount of tailoring, especially in the case of model design or parameter identification problems, physics informed deep operator networks (DeepONets) have emerged as a versatile tool for the solution of partial differential equations with the particular advantage that they easily incorporate parameter identification questions. We here present an approach where we first learn three DeepONet models for representative inflow, inner and outflow edges, resp., and then subsequently couple these models for the solution of the drift-diffusion metric graph problem by relying on an edge-based domain decomposition approach. We illustrate that our framework is applicable for the accurate evaluation of graph-coupled physics models and is well suited for solving optimization or inverse problems on these coupled networks.
Paperid:726
Authors:Yuzheng Hu · Fan Wu · Ruicheng Xian · Yuhang Liu · Lydia Zakynthinou · Pritish Kamath · Chiyuan Zhang · David Forsyth
Abstract:
Abstract:We propose the notion of empirical privacy variance and study it in the context of differentially private finetuning of language models. Specifically, we show that models calibrated to the same $(\varepsilon, \delta)$-DP guarantee using DP-SGD with different hyperparameter configurations can exhibit significant variations in empirical privacy, which we quantify through the lens of memorization. We investigate the generality of this phenomenon across multiple dimensions and discuss why it is surprising and relevant. Through regression analysis, we examine how individual and composite hyperparameters influence empirical privacy. The results reveal a no-free-lunch trade-off: existing practices of hyperparameter tuning in DP-SGD, which focus on optimizing utility under a fixed privacy budget, often come at the expense of empirical privacy. To address this, we propose refined heuristics for hyperparameter selection that explicitly account for empirical privacy, showing that they are both precise and practically useful. Finally, we take preliminary steps to understand empirical privacy variance. We propose two hypotheses, identify limitations in existing techniques like privacy auditing, and outline open questions for future research.
Paperid:727
Authors:Jian Bi · Qianliang Wu · Xiang Li · Shuo Chen · Jianjun Qian · lei luo · Jian Yang
Abstract:
Data augmentation has been widely used in machine learning. Its main goal is to transform and expand the original data using various techniques, creating a more diverse and enriched training dataset. However, due to the disorder and irregularity of point clouds, existing methods struggle to enrich geometric diversity and maintain topological consistency, leading to imprecise point cloud understanding. In this paper, we propose SinPoint, a novel method designed to preserve the topological structure of the original point cloud through a homeomorphism. It utilizes the Sine function to generate smooth displacements. This simulates object deformations, thereby producing a rich diversity of samples. In addition, we propose a Markov chain Augmentation Process to further expand the data distribution by combining different basic transformations through a random process. Our extensive experiments demonstrate that our method consistently outperforms existing Mixup and Deformation methods on various benchmark point cloud datasets, improving performance for shape classification and part segmentation tasks. Specifically, when used with PointNet++ and DGCNN, our method achieves a stateof-the-art accuracy of 90.2 in shape classification with the real-world ScanObjectNN dataset.
Paperid:728
Authors:Yang Zhang · Er Jin · Yanfei Dong · Yixuan Wu · Phil Torr · Ashkan Khakzar · Johannes Stegmaier · Kenji Kawaguchi
Abstract:
Recent advances in generative models have demonstrated remarkable capabilities in producing highquality images, but their reliance on large-scale unlabeled data has raised significant safety and copyright concerns. Efforts to address these issues by erasing unwanted concepts have shown promise. However, many existing erasure methods involve excessive modifications that compromise the overall utility of the model.In this work, we address these issues by formulating a novel minimalist concept erasure objective basedonlyon the distributional distance of final generation outputs. Building on our formulation, we derive a tractable loss for differentiable optimization that leverages backpropagation through all generation steps in an end-to-end manner. We also conduct extensive analysis to show theoretical connections with other models and methods. To improve the robustness of the erasure, we incorporate neuron masking as an alternative to model fine-tuning. Empirical evaluations on state-of-the-art flow-matching models demonstrate that our method robustly erases concepts without degrading overall model performance, paving the way for safer and more responsible generative models.
Paperid:729
Authors:Xihang Yue · Yi Yang · Linchao Zhu
Abstract:
Recent advances in operator learning have produced two distinct approaches for solving partial differential equations (PDEs): attentionbased methods offering point-level adaptability but lacking spectral constraints, and spectral-based methods providing domain-level continuity priors but limited in local flexibility. This dichotomy has hindered the development of PDE solvers with both strong flexibility and generalization capability. This work introduces Holistic Physics Mixer (HPM), a novel framework that bridges this gap by integrating spectral and physical information in a unified space. HPM unifies both approaches as special cases while enabling more powerful spectral-physical interactions beyond either method alone. This enables HPM to inherit both the strong generalization of spectral methods and the flexibility of attention mechanisms while avoiding their respective limitations. Through extensive experiments across diverse PDE problems, we demonstrate that HPM consistently outperforms state-of-the-art methods in both accuracy and computational efficiency, while maintaining strong generalization capabilities with limited training data and excellent zero-shot performance on unseen resolutions.
Paperid:730
Authors:David Pfau · Ian Davies · Diana Borsa · João Madeira Araujo · Brendan Tracey · Hado van Hasselt
Abstract:
We introduce Wasserstein Policy Optimization (WPO), an actorcritic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of thegradientof the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions -- without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.
Paperid:731
Authors:Junsu Kim · Jaeyeon Kim · Ernest Ryu
Abstract:
Lowrank adaptation (LoRA) has become a standard approach for fine-tuning large foundation models. However, our theoretical understanding of LoRA remains limited as prior analyses of LoRA's training dynamics either rely on linearization arguments or consider highly simplified setups. In this work, we analyze the LoRA loss landscape without such restrictive assumptions. We define two regimes: a "special regime", which includes idealized setups where linearization arguments hold, and a "generic regime" representing more realistic setups where linearization arguments do not hold. In the generic regime, we show that LoRA training converges to a global minimizer with low rank and small magnitude, or a qualitatively distinct solution with high rank and large magnitude. Finally, we argue that the zero-initialization and weight decay in LoRA training induce an implicit bias toward the low-rank, small-magnitude region of the parameter space—where global minima lie—thus shedding light on why LoRA training usually succeeds in finding global minima.
Paperid:732
Authors:Yongxin Yang · Wenqi Chen · Chao Shu · Timothy Hospedales
Abstract:
We propose HyperIV, a novel approach for realtime implied volatility smoothing that eliminates the need for traditional calibration procedures. Our method employs a hypernetwork to generate parameters for a compact neural network that constructs complete volatility surfaces within 2 milliseconds, using only 9 market observations. Moreover, the generated surfaces guarantee no static arbitrage. Extensive experiments across 8 index options demonstrate that HyperIV achieves superior accuracy compared to existing methods while maintaining computational efficiency. The model also exhibits strong cross-asset generalization capabilities, indicating broader applicability across different market instruments. These key features -- rapid adaptation to market conditions, guaranteed absence of arbitrage, and minimal data requirements -- make HyperIV particularly valuable for real-time trading applications. We make code available at https://anonymous.4open.science/r/3c2f82ff7e26/.
Paperid:733
Authors:Yuhan Helena Liu · Guangyu Robert Yang · Christopher Cueva
Abstract:
In the quest to uncover how the brain's learning capabilities arise from its biological ingredients, investigating biologically plausible learning rules offers a promising avenue. These rules, often approximate gradient descent learning to respect biological constraints such as locality, must meet two critical criteria to be considered an appropriate brain model: (1) good neuroscience task performance and (2) alignment with neural recordings. While extensive research has assessed the first criterion, the second remains underexamined. Employing methods such as Procrustes analysis on wellknown neuroscience datasets, this study demonstrates the existence of a biologically plausible learning rule — namely e-prop, which is based on gradient truncation and has demonstrated versatility across a wide range of tasks — that can achieve neural data similarity comparable to Backpropagation Through Time (BPTT) when matched for task accuracy. Our findings also reveal that model architecture and initial conditions can play a more significant role in determining neural similarity than the specific learning rule. Furthermore, we observe that BPTT-trained models and their biologically plausible counterparts exhibit similar dynamical properties at comparable accuracies; to support this further, we provide theoretical results demonstrating that e-prop can converge to the same solution as BPTT, depending on the initialization. These results underscore the substantial progress made in developing biologically plausible learning rules, highlighting their potential to achieve both competitive task performance and neural data similarity.
Paperid:734
Authors:Vandermeulen · Wai Ming Tai · Bryon Aragam
Abstract:
Abstract:We show that deep neural networks can achieve dimensionindependent rates of convergence for learning structured densities typical of image, audio, video, and text data. Concretely, for image density estimation, where each pixel becomes independent of the rest of the image when conditioned on pixels at most $t$ steps away, a simple $L^2$-minimizing neural network can attain a rate of $n^{-1/((t+1)^2+4)}$, where $d$ is the total number of dimensions, i.e., the full image dimensionality given by width times height. We provide empirical evidence that, in real-world applications, $t$ is often a small constant, thus effectively circumventing the curse of dimensionality. Moreover, for sequential data (e.g., audio or text) exhibiting a similar local dependence structure, our analysis shows a rate of $n^{-1/(t+5)}$, offering further evidence of dimension independence in practical scenarios.
Paperid:735
Authors:Yufei Gu · Qian-Yuan Tang · Yunfeng Cai · Mingming Sun · Ping Li · zhou Xun · Zeke Xie
Abstract:
It is wellknown that the Hessian of deep loss landscape matters to optimization and generalization of deep learning. Previous studies reported a rough Hessian structure in deep learning, which consists of two components, a small number of large eigenvalues and a large number of nearly-zero eigenvalues. To the best of our knowledge, we are the first to report that a simple but overlooked power-law Hessian structure exists in well-trained deep neural networks, including Convolutional Neural Networks (CNNs) and Large Language Models (LLMs). Moreover, we provide a maximum-entropy theoretical interpretation for the power-law Hessian structure and theoretically demonstrate the existence of robust and low-dimensional subspace of deep neural networks. Our extensive experiments using the proposed power-law spectral method demonstrate that the power-law Hessian spectra critically relate to multiple important behaviors of deep learning, including optimization, generalization, and overparameterization. Notably, we discover that the power-law Hessian structure of a given LLM can effectively predict generalization during training, while conventional sharpness-based generalization measures that often works well on CNNs become nearly useless for as a generalization predictor of LLMs.
Paperid:736
Authors:gen li · Yao Wan · Hongyu Zhang · Zhou Zhao · Wenbin Jiang · Xuanhua Shi · Hai Jin · Zheng Wang
Abstract:
Language Models (LMs) are increasingly used for type inference, aiding in error detection and software development. Some reallife deployments of LMs require the model to run on local machines to safeguard software's intellectual property. This setting often limits the size of the LMs that can be used. We present Nester, a neuro-symbolic approach to enhance LMs for type inference by integrating symbolic learning without increasing model size. Nester breaks type inference into sub-tasks based on the data and control flow of the input code, encoding them as a modular high-level program. This program executes multi-step actions, such as evaluating expressions and analyzing conditional branches of the target program, combining static typing with LMs to infer potential types. Evaluated on the ManyTypes4Py dataset in Python, Nester outperforms two state-of-the-art type inferencing methods (HiTyper and TypeGen), achieving 70.7\% Top-1 Exact Match, 18.3\% and 3.6\% higher than HiTyper and TypeGen, respectively. For complex type annotations using constructs like ''typing.Optional'' and ''typing.Union'', Nester achieves 53.2\%, surpassing TypeGen by 17.9\%.
Paperid:737
Authors:jiwei zhu · Zhao Yang · Bing Su
Abstract:
Inspired by the success of unsupervised pretraining paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce ourSpecies-ProfileAdaptiveCollaborativeExperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners.
Paperid:738
Authors:Arturs Berzins · Andreas Radler · Eric Volkmann · Sebastian Sanokowski · Sepp Hochreiter · Johannes Brandstetter
Abstract:
Geometry is a ubiquitous tool in computer graphics, design, and engineering. However, the lack of large shape datasets limits the application of stateof-the-art supervised learning methods and motivates the exploration of alternative learning strategies. To this end, we introduce geometry-informed neural networks (GINNs) -- a framework for training shape-generative neural fields without data by leveraging user-specified design requirements in the form of objectives and constraints. By adding diversity as an explicit constraint, GINNs avoid mode-collapse and can generate multiple diverse solutions, often required in geometry tasks. Experimentally, we apply GINNs to several problems spanning physics, geometry, and engineering design, showing control over geometrical and topological properties, such as surface smoothness or the number of holes. These results demonstrate the potential of training shape-generative models without data, paving the way for new generative design approaches without large datasets.
Paperid:739
Authors:Xinyu Yuan · Zichen Wang · Marcus Collins · Huzefa Rangwala
Abstract:
Recent years have witnessed a surge in the development of protein structural tokenization methods, which chunk protein 3D structures into discrete or continuous representations. Structure tokenization enables the direct application of powerful techniques like language modeling for protein structures, and large multimodal models to integrate structures with protein sequences and functional texts. Despite the progress, the capabilities and limitations of these methods remain poorly understood due to the lack of a unified evaluation framework. We first introduceStructTokenBench, a framework that comprehensively evaluates the quality and efficiency of structure tokenizers, focusing on finegrained local substructures rather than global structures, as typical in existing benchmarks. Our evaluations reveal that no single model dominates all benchmarking perspectives. Observations of codebook under-utilization led us to developAminoAseed, a simple yet effective strategy that enhances codebook gradient updates and optimally balances codebook size and dimension for improved tokenizer utilization and quality. Compared to the leading model ESM3, our method achieves an average of 6.31\% performance improvement across 24 supervised tasks, with sensitivity and utilization rates increased by 12.83\% and 124.03\%, respectively.
Paperid:740
Authors:Renzhe Xu · Kang Wang · Bo Li
Abstract:
Data heterogeneity across multiple sources is common in realworld machine learning (ML) settings. Although many methods focus on enabling a single model to handle diverse data, real-world markets often comprise multiple competing ML providers. In this paper, we propose a game-theoretic framework—the Heterogeneous Data Game—to analyze how such providers compete across heterogeneous data sources. We investigate the resulting pure Nash equilibria (PNE), showing that they can be non-existent, homogeneous (all providers converge on the same model), or heterogeneous (providers specialize in distinct data sources). Our analysis spans monopolistic, duopolistic, and more general markets, illustrating how factors such as the ``temperature'' of data-source choice models and the dominance of certain data sources shape equilibrium outcomes. We offer theoretical insights into both homogeneous and heterogeneous PNEs, guiding regulatory policies and practical strategies for competitive ML marketplaces.
Paperid:741
Authors:Wonkwang Lee · Jongwon Jeong · Taehong Moon · Hyeon-Jong Kim · Jaehyeon Kim · Gunhee Kim · Byeong-Uk Lee
Abstract:
Motion synthesis for diverse object categories holds great potential for 3D content creation but remains underexplored due to two key challenges: (1) the lack of comprehensive motion datasets that include a wide range of highquality motions and annotations, and (2) the absence of methods capable of handling heterogeneous skeletal templates from diverse objects.To address these challenges, we contribute the following:First, we augment the Truebones Zoo dataset—a high-quality animal motion dataset covering over 70 species—by annotating it with detailed text descriptions, making it suitable for text-based motion synthesis.Second, we introduce rig augmentation techniques that generate diverse motion data while preserving consistent dynamics, enabling models to adapt to various skeletal configurations.Finally, we redesign existing motion diffusion models to dynamically adapt to arbitrary skeletal templates, enabling motion synthesis for a diverse range of objects with varying structures.Experiments show our method generates realistic, coherent motions from textual descriptions for diverse and even unseen objects, setting a strong foundation for motion generation across diverse object categories with heterogeneous skeletal templates.Qualitative results are available on this link.
Paperid:742
Authors:Zitian Li · Wang Chi Cheung
Abstract:
Abstract:Motivated by an open direction in existing literature, we study the 1identification problem, a fundamental multi-armed bandit formulation on pure exploration. The goal is to determine whether there exists an arm whose mean reward is at least a known threshold $\mu_0$. The agent needs to find out an exact arm index whose mean reward is above $\mu_0$, or output $\textsf{None}$ if it believes such an arm does not exist. The agent needs to guarantee its output is correct with probability at least $1-\delta$. Degenne & Koolen 2019 has deeply studied the asymptotic nature of 1-identification problem, but they also comment non-asymptotic analysis remains unclear. To tackle this, we put forward a new algorithm Sequential-Exploration-Exploitation(SEE), and conduct theoretical analysis from the non-asymptotic perspective. To the best of our knowledge, we are the first one to achieve near optimality, in the sense of matching upper and lower bounds on the pulling complexity. The gap between the upper and lower bounds is up to a polynomial logarithm factor. The numeric result also indicates the effectiveness of our algorithm, compared to some existing benchmarks.
Paperid:743
Authors:Michael S Yao · James Gee · Osbert Bastani
Abstract:
The goal of offline modelbased optimization (MBO) is to propose new designs that maximize a reward function given only an offline dataset. However, an important desiderata is to also propose adiverseset of final candidates that capture many optimal and near-optimal design configurations. We proposeDiversityInAdversarialModel-basedOptimization (DynAMO) as a novel method to introduce design diversity as an explicit objective into any MBO problem. Our key insight is to formulate diversity as adistribution matching problemwhere the distribution of generated designs captures the inherent diversity contained within the offline dataset. Extensive experiments spanning multiple scientific domains show that DynAMO can be used with common optimization methods to significantly improve the diversity of proposed designs while still discovering high-quality candidates.
Paperid:744
Authors:Dong Huang · Pengkun Yang
Abstract:
Correlation analysis is a fundamental step in uncovering meaningful insights from complex datasets. In this paper, we study the problem of detecting correlations between two random graphs following the Gaussian Wigner model with unlabeled vertices. Specifically, the task is formulated as a hypothesis testing problem: under the null hypothesis, the two graphs are independent, while under the alternative hypothesis, they are edgecorrelated through a latent vertex permutation, yet maintain the same marginal distributions as under the null. We focus on the scenario where two induced subgraphs, each with a fixed number of vertices, are sampled. We determine the optimal rate for the sample size required for correlation detection, derived through an analysis of the conditional second moment. Additionally, we propose an efficient approximate algorithm that significantly reduces running time.
Paperid:745
Authors:Yangxu Liao · Wenke Huang · Guancheng Wan · Jian Liang · Bin Yang · Mang Ye
Abstract:
Federated learning provides an efficient privacypreserving distributed training framework for large language models, addressing the growing scarcity of publicly available training data while enabling the utilization of private datasets. While integrating large language model fine-tuning with federated learning emerges as a promising research direction, researchers pay limited attention to non-IID instruction-following scenarios. Our key insight is decomposing client updates into consensus and divergence components, enabling the model to maintain core capabilities while adapting to domain-specific knowledge. We propose a novel federated learning framework calledFedICU(Splitting withImportanCe-awareUpdating for HeterogeneousFederated Learning with Large Language Models), which introduces an aggregation mechanism that dynamically balances these components based on their contribution to global model performance, while implementing an importance-aware parameter updating strategy to prevent catastrophic forgetting and domain overfitting. Extensive experiments across diverse domains demonstrate that FedICU significantly outperforms existing federated learning approaches in terms of both generalization performance and domain adaptation. Our code is available at https://anonymous.4open.science/r/FedICU-7EC7.
Paperid:746
Authors:Yunzhen Yao · Lie He · Michael Gastpar
Abstract:
Abstract:This paper considers the sampleefficiency of preference learning, which models and predicts human choices based on comparative judgments. The minimax optimal estimation rate $\Theta(d/n)$ in traditional estimation theory requires that the number of samples $n$ scales linearly with the dimensionality of the feature space $d$. However, the high dimensionality of the feature space and the high cost of collecting human-annotated data challenge the efficiency of traditional estimation methods. To remedy this, we leverage sparsity in the preference model and establish sharp estimation rates. We show that under the sparse random utility model, where the parameter of the reward function is $k$-sparse, the minimax optimal rate can be reduced to $\Theta(k/n \log(d/k))$. Furthermore, we analyze the $\ell_{1}$-regularized estimator and show that it achieves near-optimal rate under mild assumptions on the Gram matrix. Experiments on synthetic data and LLM alignment data validate our theoretical findings, showing that sparsity-aware methods significantly reduce sample complexity and improve prediction accuracy.
Paperid:747
Authors:Runzhong Wang · Rui-Xi Wang · Mrunali Manjrekar · Connor Coley
Abstract:
Molecular machine learning has gained popularity with the advancements of geometric deep learning. In parallel, retrievalaugmented generation has become a principled approach commonly used with language models. However, the optimal integration of retrieval augmentation into molecular machine learning remains unclear. Graph neural networks stand to benefit from clever matching to understand the structural alignment of retrieved molecules to a query molecule. Neural graph matching offers a compelling solution by explicitly modeling node and edge affinities between two structural graphs while employing a noise-robust, end-to-end neural network to learn affinity metrics. We apply this approach to mass spectrum simulation and introduce MARASON, a novel model that incorporates neural graph matching to enhance a fragmentation-based neural network. Experimental results highlight the effectiveness of our design, with MARASON achieving 27% top-1 accuracy, a substantial improvement over the non-retrieval state-of-the-art accuracy of 19%. Moreover, MARASON outperforms both naive retrieval-augmented generation methods and traditional graph matching approaches.
Paperid:748
Authors:Shiqi Chen · Jinghan Zhang · Tongyao Zhu · Wei Liu · Siyang Gao · Miao Xiong · Manling Li · Junxian He
Abstract:
VisionLanguage Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood.In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging modelsacross modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in atraining-freemanner.Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.
Paperid:749
Authors:Ziang Zhou · Zhihao DING · Jieming Shi · Qing Li · Shiqi Shen
Abstract:
Graph Neural Networks (GNNs) are pivotal in graphbased learning, particularly excelling in node classification. However, their scalability is hindered by the need for multi-hop data during inference, limiting their application in latency-sensitive scenarios. Recent efforts to distill GNNs into multi-layer perceptrons (MLPs) for faster inference often underutilize the layer-level insights of GNNs. In this paper, we present TINED, a novel approach that distills GNNs to MLPs on a layer-by-layer basis using Teacher Injection and Dirichlet Energy Distillation techniques.We focus on two key operations in GNN layers: feature transformation (FT) and graph propagation (GP). We recognize that FT is computationally equivalent to a fully-connected (FC) layer in MLPs. Thus, we propose directly transferring teacher parameters from an FT in a GNN to an FC layer in the student MLP, enhanced by fine-tuning. In TINED, the FC layers in an MLP replicate the sequence of FTs and GPs in the GNN. We also establish a theoretical bound for GP approximation.Furthermore, we note that FT and GP operations in GNN layers often exhibit opposing smoothing effects: GP is aggressive, while FT is conservative. Using Dirichlet energy, we develop a DE ratio to measure these effects and propose Dirichlet Energy Distillation to convey these characteristics from GNN layers to MLP layers. Extensive experiments show that TINED outperforms GNNs and leading distillation methods across various settings and seven datasets. Source code are available at \url{https://anonymous.4open.science/r/TINED-7527/}.
Paperid:750
Authors:Karthik Sridharan · Seung Won Wilson Yoo
Abstract:
Abstract:We consider the problem of online learning where the sequence of actions played by the learner must adhere to an unknown safety constraint at every round. The goal is to minimize regret with respect to the best safe action in hindsight while simultaneously satisfying the safety constraint with high probability on each round. We provide a general metaalgorithm that leverages an online regression oracle to estimate the unknown safety constraint, and converts the predictions of an online learning oracle to predictions that adhere to the unknown safety constraint. On the theoretical side, our algorithm's regret can be bounded by the regret of the online regression and online learning oracles, the eluder dimension of the model class containing the unknown safety constraint, and a novel complexity measure that characterizes the difficulty of safe learning. We complement our result with an asymptotic lower bound that shows that the aforementioned complexity measure is necessary. When the constraints are linear, we instantiate our result to provide a concrete algorithm with $\sqrt{T}$ regret using a scaling transformation that balances optimistic exploration with pessimistic constraint satisfaction.
Paperid:751
Authors:Shiqing Gao · Jiaxin Ding · Luoyi Fu · Xinbing Wang
Abstract:
Constrained Reinforcement Learning (CRL) aims to maximize cumulative rewards while satisfying constraints. However, existing CRL algorithms often encounter significant constraint violations during training, limiting their applicability in safetycritical scenarios. In this paper, we identify the underestimation of the cost value function as a key factor contributing to these violations. To address this issue, we propose the Memory-driven Intrinsic Cost Estimation (MICE) method, which introduces intrinsic costs to mitigate underestimation and leverages bias to promote safer exploration. Inspired by flashbulb memory, where humans vividly recall dangerous experiences to avoid risks, MICE constructs a memory module that stores previously explored unsafe states to identify high-cost regions. The intrinsic cost is formulated as the visit count of the current state to these risk regions. Furthermore, we propose an extrinsic-intrinsic cost value function that incorporates intrinsic costs and adopts a bias correction strategy. Using this function, we formulate an optimization objective within the trust region, along with corresponding optimization methods. Theoretically, we provide convergence guarantees for the proposed cost value function and establish the worst-case constraint violation for the MICE update, ensuring fewer constraint violations compared to baselines. Extensive experiments demonstrate that MICE significantly reduces constraint violations while preserving policy performance comparable to baselines.
Paperid:752
Authors:Yunze Tong · Fengda Zhang · Zihao Tang · Kaifeng Gao · Kai Huang · Pengfei Lyu · Jun Xiao · Kun Kuang
Abstract:
Machine learning models often perform well on tabular data by optimizing average prediction accuracy. However, they may underperform on specific subsets due to inherent biases and spurious correlations in the training data, such as associations with noncausal features like demographic information. These biases lead to critical robustness issues as models may inherit or amplify them, resulting in poor performance where such misleading correlations do not hold. Existing mitigation methods have significant limitations: some require prior group labels, which are often unavailable, while others focus solely on the conditional distribution (P(Y|X)), upweighting misclassified samples without effectively balancing the overall data distribution (P(X)). To address these shortcomings, we propose a latent score-based reweighting framework. It leverages score-based models to capture the joint data distribution (P(X, Y)) without relying on additional prior information. By estimating sample density through the similarity of score vectors with neighboring data points, our method identifies underrepresented regions and upweights samples accordingly. This approach directly tackles inherent data imbalances, enhancing robustness by ensuring a more uniform dataset representation. Experiments on various tabular datasets under distribution shifts demonstrate that our method effectively improves performance on imbalanced data.
Paperid:753
Authors:Siavash Ameli · Chris van der Heide · Liam Hodgkinson · Fred Roosta · Michael Mahoney
Abstract:
Abstract:Calculating or accurately estimating logdeterminants of large positive semi-definite matrices is of fundamental importance in many machine learning tasks. While its cubic computational complexity can already be prohibitive, in modern applications even storing the matrices themselves can pose a memory bottleneck. To address this, we derive a novel hierarchical algorithm based on block-wise computation of the LDL decomposition for large-scale log-determinant calculation in memory-constrained settings. In extreme cases where matrices are highly ill-conditioned, accurately computing the full matrix itself may be infeasible. This is particularly relevant when considering kernel matrices at scale, including the empirical Neural Tangent Kernel (NTK) of neural networks trained on large datasets. Under the assumption of neural scaling laws in the test error, we show that the ratio of pseudo-determinants satisfies a power-law relationship, enabling the derivation of corresponding scaling laws. This allows for accurate estimation of NTK log-determinants from a tiny fraction of the full dataset; in our experiments, this results in a $\sim$100,000$\times$ speedup with improved accuracy to other state-of-the-art approaches. Using these techniques, we successfully estimate log-determinants for dense matrices of extreme sizes, which were previously deemed intractable and inaccessible due to their enormous scale and computational requirements.
Paperid:754
Authors:Gabriel Kasmi · Amandine Brunetto · Thomas Fel · Jayneel Parekh
Abstract:
A major limitation of deep neural networks in safetycritical contexts is their lack of transparency. Feature attribution methods address this by identifying which features influence a model’s decisions. For high-dimensional inputs like images, attributions often appear as pixel-based heatmaps. However, extending pixel-based methods to non-image domains, leads to suboptimal explanations. This paper introduces the Wavelet Attribution Method (WAM) which replaces pixels with wavelet coefficients as features for attribution. Wavelet coefficients, defined for square-integrable signals across dimensions, provide a mathematically grounded and semantically richer domain. We demonstrate WAM's effectiveness on audio, images, and volumes, where it quantitatively matches or outperforms existing gradient-based methods. Explanations generated by WAM leverage the spatial and scale-localized properties of wavelet coefficients, capturing both the where and what of a model's focus. We illustrate this by showing how WAM relates to model robustness, enables deeper volume interpretation, and identifies key audio components in noisy samples.
Paperid:755
Authors:Canhong Wen · Yihong Zuo · Wenliang Pan
Abstract:
Abstract:Large Language Models (LLMs) have showcased remarkable performance across a range of tasks but are hindered by their massive parameter sizes, which impose significant computational and storage demands. Pruning has emerged as an effective solution to reduce model size, but traditional methods often involve inefficient retraining or rely on heuristicbased one-shot approaches that lack theoretical guarantees. In this paper, we reformulate the pruning problem as an $\ell_0$-penalized optimization problem and propose a monotone accelerated Iterative Hard Thresholding (mAIHT) method. Our approach combines solid theoretical foundations with practical effectiveness, offering a detailed theoretical analysis that covers convergence, convergence rates, and risk upper bounds. Through extensive experiments, we demonstrate that mAIHT outperforms state-of-the-art pruning techniques by effectively pruning the LLaMA-7B model across various evaluation metrics.
Paperid:756
Authors:Kexin Huang · Junkang Wu · Ziqian Chen · xue wang · Jinyang Gao · Bolin Ding · Jiancan Wu · Xiangnan He · Xiang Wang
Abstract:
Preference learning is critical for aligning large language models (LLMs) with human values, with the quality of preference datasets playing a crucial role in this process. While existing metrics primarily assess data quality based on eitherexplicitorimplicitreward margins, they often provide contradictory evaluations for the same data.To address this issue, we introduce thealignment potentialmetric, which quantifies the gap from the model'scurrent implicitreward margin to thetarget explicitreward margin, thereby estimating the model's potential to align with the preference data.Empirical results demonstrate that training on data selected by this metric consistently enhances alignment performance, surpassing existing metrics across different base models and optimization objectives.Furthermore, our method extends to selfplay data generation frameworks, where the metric is used to identify high-quality data within the self-generated content by LLMs. Under this data generation scenario, our method surpasses current state-of-the-art (SOTA) results across various training settings and demonstrates continuous improvements in alignment performance as dataset size and training iterations increase.
Paperid:757
Authors:Marta Aparicio Rodriguez · Xenia Miscouridou · Anastasia Borovykh
Abstract:
Despite significant advances in quality and complexity of the generations in textto-image models,promptingdoes not always lead to the desired outputs. Controlling model behaviour by directlysteeringintermediate model activations has emerged as a viable alternative allowing toreachconcepts in latent space that may otherwise remain inaccessible by prompt. In this work, we introduce a set of experiments to deepen our understanding of concept reachability. We design a training data setup with three key obstacles: scarcity of concepts, underspecification of concepts in the captions, and data biases with tied concepts. Our results show: (i) concept reachability in latent space exhibits a distinct phase transition, with only a small number of samples being sufficient to enable reachability, (ii)wherein the latent space the intervention is performed critically impacts reachability, showing that certain concepts are reachable only at certain stages of transformation, and (iii) while prompting ability rapidly diminishes with a decrease in quality of the dataset, concepts often remain reliably reachable through steering. Model providers can leverage this to bypass costly retraining and dataset curation and instead innovate with user-facing control mechanisms.
Paperid:758
Authors:Chengpiao Huang · Yuhang Wu · Kaizheng Wang
Abstract:
We investigate the reliable use of simulated survey responses from large language models (LLMs) through the lens of uncertainty quantification. Our approach converts synthetic data into confidence sets for population parameters of human responses, addressing the distribution shift between the simulated and real populations. A key innovation lies in determining the optimal number of simulated responses: too many produce overly narrow confidence sets with poor coverage, while too few yield excessively loose estimates. To resolve this, our method adaptively selects the simulation sample size, ensuring valid averagecase coverage guarantees. It is broadly applicable to any LLM, irrespective of its fidelity, and any procedure for constructing confidence sets. Additionally, the selected sample size quantifies the degree of misalignment between the LLM and the target human population. We illustrate our method on real datasets and LLMs.
Paperid:759
Authors:Guancheng Wan · Zijie Huang · Wanjia Zhao · Xiao Luo · Yizhou Sun · Wei Wang
Abstract:
Coupled dynamical systems govern essential phenomena across physics, biology, and engineering, where components interact through complex dependencies. While Graph Ordinary Differential Equations (GraphODE) offer a powerful framework to model these systems, theirgeneralizationcapabilities degrade severely under limited observational training data due to two fundamental flaws: (i) the entanglement of static attributes and dynamic states in the initialization process, and (ii) the reliance on contextspecific coupling patterns during training, which hinders performance in unseen scenarios. In this paper, we propose a Generalizable GraphODE with disentanglement and regularization (GREAT) to address these challenges. Through systematic analysis via the Structural Causal Model, we identify backdoor paths that undermine generalization and design two key modules to mitigate their effects. TheDynamic-Static Equilibrium Decoupler (DyStaED)disentangles static and dynamic states via orthogonal subspace projections, ensuring robust initialization. Furthermore, theCausal Mediation for Coupled Dynamics (CMCD)employs variational inference to estimate latent causal factors, reducing spurious correlations and enhancing universal coupling dynamics. Extensive experiments across diverse dynamical systems demonstrate that ours outperforms state-of-the-art methods within both in-distribution and out-of-distribution.
Paperid:760
Authors:Jiawei Cao · Chaochen Gu · Changsheng Lu · Hao Cheng · Xiaofeng Zhang · Kaijie Wu
Abstract:
Polygonbased object representations efficiently model object boundaries but are limited by high optimization complexity, which hinders their adoption compared to more flexible pixel-based methods. In this paper, we introduce a novel vertex regression loss grounded in Fourier elliptic descriptors, which removes the need for rasterization or heuristic approximations and resolves ambiguities in boundary point assignment through frequency-domain matching.To advance polygon-based instance segmentation, we further propose EFDTR (\textbf{E}lliptical \textbf{F}ourier \textbf{D}escriptor \textbf{Tr}ansformer), an end-to-end learnable framework that leverages the expressiveness of Fourier-based representations. The model achieves precise contour predictions through a two-stage approach: the first stage predicts elliptical Fourier descriptors for global contour modeling, while the second stage refines contours for fine-grained accuracy. Experimental results on the COCO dataset show that EFDTR outperforms existing polygon-based methods, offering a promising alternative to pixel-based approaches.
Paperid:761
Authors:Shahaf Bassan · Yizhak Elboher · Tobias Ladner · Matthias Althoff · Guy Katz
Abstract:
Despite significant advancements in posthoc explainability techniques for neural networks, many current methods rely on heuristics and do not provide formally provable guarantees over the explanations provided. Recent work has shown that it is possible to obtain explanations with formal guarantees by identifying subsets of input features that are sufficient to determine that predictions remain unchangedusing neural network verification techniques.Despite the appeal of these explanations, their computation faces significant scalability challenges. In this work, we address this gap by proposing a novel abstraction-refinement technique for efficiently computing provably sufficient explanations of neural network predictions. Our methodabstractsthe original large neural network by constructing a substantially reduced network, where a sufficient explanation of the reduced network is alsoprovably sufficientfor the original network, hence significantly speeding up the verification process. If the explanation is insufficient on the reduced network, we iterativelyrefinethe network size by gradually increasing it until convergence.Our experiments demonstrate that our approach enhances the efficiency of obtaining provably sufficient explanations for neural network predictions while additionally providing a fine-grained interpretation of the network's predictions across different abstraction levels.
Paperid:762
Authors:Weihang Ran · Wei Yuan · Yinqiang Zheng
Abstract:
Damage to imaging systems and complex external environments often introduce corruption, which can impair the performance of deep learning models pretrained on highquality image data. Previous methods have focused on restoring degraded images or fine-tuning models to adapt to out-of-distribution data. However, these approaches struggle with complex, unknown corruptions and often reduce model accuracy on high-quality data. Inspired by the use of warning colors and camouflage in the real world, we propose designing a robust appearance that can enhance model recognition of low-quality image data. Furthermore, we demonstrate that certain universal features in radiance fields can be applied across objects of the same class with different geometries. We also examine the impact of different proxy models on the transferability of robust appearances. Extensive experiments demonstrate the effectiveness of our proposed method, which outperforms existing image restoration and model fine-tuning approaches by 34.97\% to 73.97\% in accuracy across different experimental settings, and retains effectiveness when transferred to models with different architectures.
Paperid:763
Authors:Jiawei Zhang · Xuan Yang · Taiqi Wang · Yu Yao · Aleksandr Petiushko · Bo Li
Abstract:
Abstract:Traditional autonomous driving systems often struggle to integrate highlevel reasoning with low-level control, resulting in suboptimal and sometimes unsafe driving behaviors. The emergence of multimodal large language models (MLLMs), which can process both visual and textual data, presents an opportunity to unify perception and reasoning tasks within a single framework. However, effectively embedding precise safety knowledge into MLLMs for autonomous driving remains a significant challenge.To address this, we propose SafeAuto, a novel framework that enhances MLLM-based autonomous driving systems by incorporating both unstructured and structured knowledge. Specifically, we first introduce the Position-Dependent Cross-Entropy (PDCE) loss function, designed to improve the accuracy of low-level control signal predictions when numerical values are represented as text. Second, to ensure safe autonomous driving by explicitly integrating precise safety knowledge into the MLLM, we develop a reasoning component for SafeAuto. This component translates driving safety regulations into first-order logic rules (e.g., "red light $\implies$ stop") and incorporates these rules into a probabilistic graphical model, such as a Markov Logic Network (MLN). The MLN is trained to verify the predicted next actions using environmental attributes identified by attribute recognition models (e.g., detecting a red light) to form the predicates. Additionally, we construct a Multimodal Retrieval-Augmented Generation (Multimodal RAG) model that leverages video data, control signals, and environmental attributes to learn more effectively from past similar driving experiences.By integrating PDCE, MLN, and Multimodal RAG, SafeAuto significantly outperforms existing baselines across multiple datasets. This advancement paves the way for more accurate, reliable, and safer autonomous driving systems that effectively learn from experience, adhere to traffic regulations, and execute precise control actions.
Paperid:764
Authors:Puli Wang · Yu Qi · Yueming Wang · Gang Pan
Abstract:
The primary goal of braincomputer interfaces (BCIs) is to establish a direct linkage between neural activities and behavioral actions via neural decoders. Due to the nonstationary property of neural signals, BCIs trained on one day usually obtain degraded performance on other days, hindering the user experience. Existing studies attempted to address this problem by aligning neural signals across different days. However, these neural adaptation methods may exhibit instability and poor performance when only a few trials are available for alignment, limiting their practicality in real-world BCI deployment. To achieve efficient and stable neural adaptation with few trials, we propose Flow-Based Distribution Alignment (FDA), a novel framework that utilizes flow matching to learn flexible neural representations with stable latent dynamics, thereby facilitating source-free domain alignment through likelihood maximization. The latent dynamics of FDA framework is theoretically proven to be stable using Lyapunov exponents, allowing for robust adaptation. Further experiments across multiple motor cortex datasets demonstrate the superior performance of FDA, achieving reliable results with fewer than five trials. Our FDA approach offers a novel and efficient solution for few-trial neural data adaptation, offering significant potential for improving the long-term viability of real-world BCI applications.
Paperid:765
Authors:Tianyi Liang · Jiangqi Liu · Yifei Huang · Shiqi Jiang · Jianshen Shi · Changbo Wang · Chenhui Li
Abstract:
Textto-image (T2I) generation has made remarkable progress in producing high-quality images, but a fundamental challenge remains: creating backgrounds that naturally accommodate text placement without compromising image quality. This capability is non-trivial for real-world applications like graphic design, where clear visual hierarchy between content and text is essential.Prior work has primarily focused on arranging layouts within existing static images, leaving unexplored the potential of T2I models for generating text-friendly backgrounds.We present TextCenGen, a training-free approach that actively relocates objects before optimizing text regions, rather than directly reducing cross-attention which degrades image quality. Our method introduces: (1) a force-directed graph approach that detects conflicting objects and guides them relocation using cross-attention maps, and (2) a spatial attention constraint that ensures smooth background generation in text regions. Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality.Evaluated on our proposed text-friendly T2I benchmark of 27,000 images across three seed datasets, TextCenGen outperforms existing methods by achieving 23\% lower saliency overlap in text regions while maintaining 98\% of the original semantic fidelity measured by CLIP score and our proposed Visual-Textual Concordance Metric (VTCM).
Paperid:766
Authors:Akashah Shabbir · Ilmuz Zaman Mohammed Zumri · Mohammed Bennamoun · Fahad Khan · Salman Khan
Abstract:
Recent advances in large multimodal models (LMMs) have recognized finegrained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models perform poorly for remote sensing (RS). The distinct overhead viewpoint, scale variation, and presence of small objects in high-resolution RS imagery present a unique challenge in region-level comprehension. Moreover, the development of the grounding conversation capability of LMMs within RS is hindered by the lack of granular, RS domain-specific grounded data. Addressing these limitations, we propose GeoPixel - the first end-to-end high-resolution RS-LMM that supports pixel-level grounding. This capability allows fine-grained visual perception by generating interleaved masks in conversation. GeoPixel supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis. To support the grounded conversation generation (GCG) in RS imagery, we curate a visually grounded dataset GeoPixelD through a semi-automated pipeline that utilizes set-of-marks prompting and spatial priors tailored for RS data to methodically control the data generation process. GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture. Our code and data will be publicly released.
Paperid:767
Authors:Alina Ene · Alessandro Epasto · Vahab Mirrokni · Hoai-An Nguyen · Huy Nguyen · David Woodruff · Peilin Zhong
Abstract:
Abstract:In the maximum coverage problem we are given $d$ subsets from a universe $[n]$, and the goal is to output $k$ subsets such that their union covers the largest possible number of distinct items. We present the first algorithm for maximum coverage in the turnstile streaming model, where updates which insert or delete an item from a subset come oneby-one. Notably our algorithm only uses $poly\log n$ update time. We also present turnstile streaming algorithms for targeted and general fingerprinting for risk management where the goal is to determine which features pose the greatest re-identification risk in a dataset. As part of our work, we give a result of independent interest: an algorithm to compute the complement of the $p^{\text{th}}$ frequency moment of a vector for $p \geq 2$. Empirical evaluation confirms the practicality of our fingerprinting algorithms demonstrating a speedup of up to $210$x over prior work.
Paperid:768
Authors:Minghao Wu · Thuy-Trang Vu · Lizhen Qu · Reza Haffari
Abstract:
The performance of large language models (LLMs) is strongly influenced by the quality and diversity of data used during supervised finetuning (SFT). However, current data selection methods often prioritize one aspect over the other, resulting in suboptimal training outcomes. To address this, we formulate data selection as a set cover problem and present GraphFilter, a novel approach that balances both quality and diversity in data selection. GraphFilter models the dataset as a bipartite graph connecting sentences to their constituent n-grams, then employs a priority function that combines quality and diversity metrics multiplicatively. GraphFilter iteratively selects sentences with the highest priority, removes covered n-grams from the bipartite graph, and recomputes priorities to reflect the changing data landscape. We validate GraphFilter using three model backbones across six widely-used benchmarks, demonstrating that it outperforms nine existing baselines in both model performance and computational efficiency. Further analysis shows that our design choices lead to more effective subset selection, underscores the value of instruction diversity, and provides insights into how quality and diversity interact with different subset sizes.
Paperid:769
Authors:Jintao Zhang · Haofeng Huang · Pengle Zhang · Jia wei · Jun Zhu · Jianfei Chen
Abstract:
Abstract:Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. To further enhance the efficiency of attention computation compared to SageAttention while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a hardware-friendly thread-level granularity and quantize matrixes $(\widetilde P, V)$ to FP8. Second, we propose a method to smooth $Q$, enhancing the accuracy of INT4 $QK^\top$. Third, we propose a two-level accumulation strategy for $\widetilde PV$ to enhance the accuracy of FP8 $\widetilde PV$.The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about **3x** and **4.5x** on RTX4090, respectively. Moreover, SageAttention2 matches the speed of FlashAttention3(fp8) on the Hopper GPUs, while delivering much higher accuracy. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models—including those for language, image, and video generation.
Paperid:770
Authors:Yicheng Luo · Bowen Zhang · Zhen Liu · Qianli Ma
Abstract:
Multiscale information is crucial for multivariate time series modeling. However, most existing time series multi-scale analysis methods treat all variables in the same manner, making them unsuitable for Irregular Multivariate Time Series (IMTS), where variables have distinct origin scales/sampling rates. To fill this gap, we propose Hi-Patch, a hierarchical patch graph network. Hi-Patch encodes each observation as a node, represents and captures local temporal and inter-variable dependencies of densely sampled variables through an intra-patch graph layer, and obtains patch-level nodes through aggregation. These nodes are then updated and re-aggregated through a stack of inter-patch graph layers, where several scale-specific graph networks progressively extract more global temporal and inter-variable features of both sparsely and densely sampled variables under specific scales. The output of the last layer is fed into task-specific decoders to adapt to different downstream tasks. Experiments on 8 datasets demonstrate that Hi-Patch outperforms state-of-the-art models in IMTS forecasting and classification tasks. Code is available at: https://anonymous.4open.science/r/Hi-Patch.
Paperid:771
Authors:Haibo Zhao · Dian Wang · Yizhe Zhu · Xupeng Zhu · Owen Howell · Linfeng Zhao · Yaoyao Qian · Robin Walters · Robert Platt
Abstract:
Recent advances in hierarchical policy learning highlight the advantages of decomposing systems into highlevel and low-level agents, enabling efficient long-horizon reasoning and precise fine-grained control. However, the interface between these hierarchy levels remains underexplored, and existing hierarchical methods often ignore domain symmetry, resulting in the need for extensive demonstrations to achieve robust performance. To address these issues, we propose Hierarchical Equivariant Policy (HEP), a novel hierarchical policy framework. We propose a frame transfer interface for hierarchical policy learning, which uses the high-level agent's output as a coordinate frame for the low-level agent, providing a strong inductive bias while retaining flexibility. Additionally, we integrate domain symmetries into both levels and theoretically demonstrate the system's overall equivariance. HEP achieves state-of-the-art performance in complex robotic manipulation tasks, demonstrating significant improvements in both simulation and real-world settings.
Paperid:772
Authors:Jiaru Qian · Guancheng Wan · Wenke Huang · Guibin Zhang · Yuxin Wu · Bo Du · Mang Ye
Abstract:
Federated Graph Learning (FGL) proposes an effective approach to collaboratively training Graph Neural Networks (GNNs) while maintaining privacy. Nevertheless, communication efficiency becomes a critical bottleneck in environments with limited resources. In this context, oneshot FGL emerges as a promising solution by restricting communication to a single round. However, prevailing FGL methods face two key challenges in the one-shot setting: 1) They heavily rely on gradual personalized optimization over multiple rounds, undermining the capability of the global model to efficiently generalize across diverse graph structures. 2) They are prone to overfitting to local data distributions due to extreme structural bias, leading to catastrophic forgetting. To address these issues, we introduceGHOST, an innovative one-shot FGL framework. In GHOST, we establish a proxy model for each client to leverage diverse local knowledge and integrate it to train the global model. During training, we compute the Topology-Consistency Criterion to identify and consolidate parameters essential for capturing topological knowledge, thereby mitigating catastrophic forgetting. Extensive experiments on real-world tasks demonstrate the superiority and generalization capability of GHOST. The code is available for anonymous access at https://anonymous.4open.science/r/GHOST-ICML4443.
Paperid:773
Authors:Chengbo Zang · Mehmet Turkcan · Gil Zussman · Zoran Kostic · Javad Ghaderi
Abstract:
Abstract:We propose a framework for adaptive data collection aimed at robust learning in multidistribution scenarios under a fixed data collection budget. In each round, the algorithm selects a distribution source to sample from for data collection, incorporates the collected samples into the dataset, updates the model parameters, and repeats the process. The objective is to find the model parameters that minimize the expected loss across all the data sources. Our approach integrates upper-confidence-bound (UCB) sampling with online gradient descent (OGD) to dynamically collect and annotate data from multiple sources. By bridging online optimization and multi-armed bandits, we provide theoretical guarantees for our UCB-OGD approach, demonstrating that it achieves a minimax regret of $O(T^{\frac{1}{2}}(K\ln T)^{\frac{1}{2}})$ over $K$ data sources after $T$ rounds. We further provide a lower bound showing that the result is optimal up to a $\ln T$ factor. Extensive evaluations on standard datasets and a real-world testbed for object detection in smart-city intersections validate the consistent performance improvements of our method compared to baselines such as random sampling and various active learning methods.
Paperid:774
Authors:Amin Nejatbakhsh · Yixin Wang
Abstract:
Neural circuits produce signals that are complex and nonlinear. To facilitate the understanding of neural dynamics, a popular approach is to fit state space models (SSM) to data and analyze the dynamics of the lowdimensional latent variables. Despite the power of SSM in explaining neural circuit dynamics, it has been shown that these models merely capture statistical associations in the data and cannot be causally interpreted. Therefore, an important research problem is to build models that can predict neural dynamics under causal manipulations. Here, we propose interventional state space models (iSSM), a class of causal models that can predict neural responses to novel perturbations. We draw on recent advances in causal dynamical systems and present theoretical results for the identifiability of iSSM. In simulations of the motor cortex, we show that iSSM can recover the true latents and the underlying dynamics. In addition, we illustrate two applications of iSSM in biological datasets. First, we apply iSSM to a dataset of calcium recordings from ALM neurons in mice during photostimulation. Second, we apply iSSM to a dataset of electrophysiological recordings from macaque dlPFC recordings during micro-stimulation. We show in both cases iSSM outperforms SSM and results in identifiable parameters.
Paperid:775
Authors:Gursimran Singh · Xinglu Wang · Yifan Hu · Timothy Yu · Linzi Xing · Wei Jiang · Zhefeng Wang · Bai Xiaolong · Yi Li · Ying Xiong · Yong Zhang · Zhenan Fan
Abstract:
Abstract:Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step negatively impacting key Service Level Objectives (SLOs) like time to first token (TTFT) and endto-end throughput (E2ETP). We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our approach decouple these steps unlocking new opportunities and optimizations. These include a new mechanism to cache multimedia tokens for efficient transfer, a novel way to parallelize encoding load within a request, a module to find the optimal resource allocation for disaggregated serving, and a novel role switching method to handle changing workload characteristics. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15$\times$ less utilization), batch sizes (up to 22$\times$ larger), 10$\times$ more images/request, and 2.2$\times$ larger KV caches. Further, it leads to significant improvements in latency metrics (TTFT up to 71\% reduction) and end-to-end throughput (up to 57\% reduction), compared to systems that do not disaggregate.
Paperid:776
Authors:Yang Zhou · Hongyi Liu · Zhuoming Chen · Yuandong Tian · Beidi Chen
Abstract:
Abstract:Recently, longcontext large language models (LLMs) have shown strong performance in information retrieval and long-document QA. However, to tackle the most challenging intellectual problems, LLMs must reason effectively in long and complex contexts (e.g., frontier mathematical research). Studying how LLMs handle increasing reasoning complexity and context length is essential, yet existing benchmarks lack a solid basis for quantitative evaluation. Inspired by the abstraction of GSM-8K problems as computational graphs—and the ability to introduce noise by adding unnecessary nodes and edges—we develop a grade-school math problem generator capable of producing arithmetic problems with infinite difficulty and context length under fine-grained control. Using our newly synthesized GSM-$\infty$ benchmark, we comprehensively evaluate existing LLMs. We find a consistent sigmoid decline in reasoning performance as complexity increases, along with a systematic inference scaling trend: exponentially increasing inference computation yields only linear performance gains. These findings underscore the fundamental limitations of current long-context LLMs and the key challenges in scaling reasoning capabilities. Our GSM-$\infty$ benchmark provides a scalable and controllable testbed for systematically studying and advancing LLM reasoning in long and complex contexts.
Paperid:777
Authors:Ziyue Zhang · Mingbao Lin · Shuicheng YAN · Rongrong Ji
Abstract:
This paper introduces EasyInv, an easy yet novel approach that significantly advances the field of DDIM Inversion by addressing the inherent inefficiencies and performance limitations of traditional iterative optimization methods. At the core of our EasyInv is a refined strategy for approximating inversion noise, which is pivotal for enhancing the accuracy and reliability of the inversion process. By prioritizing the initial latent state, which encapsulates rich information about the original images, EasyInv steers clear of the iterative refinement of noise items. Instead, we introduce a methodical aggregation of the latent state from the preceding time step with the current state, effectively increasing the influence of the initial latent state and mitigating the impact of noise. We illustrate that EasyInv is capable of delivering results that are either on par with or exceed those of the conventional DDIM Inversion approach, especially under conditions where the model's precision is limited or computational resources are scarce. Concurrently, our EasyInv offers an approximate threefold enhancement regarding inference efficiency over offthe-shelf iterative optimization techniques. It could be add to almost any proposed inversion method with only four lines of codes.
Paperid:778
Authors:Hao Li · Qi Lv · Rui Shao · Xiang Deng · Yinchuan Li · Jianye Hao · Liqiang Nie
Abstract:
Transforming complex actions into discrete skill abstractions has demonstrated strong potential for robotic manipulation.Existing approaches mainly leverage latent variable models, e.g., VQVAE, to learn skill abstractions through learned vectors (codebooks), while they suffer from codebook collapse and modeling the causal relationship between learned skills. To address these limitations, we presentSkillTraining withAugmentedRotation (STAR), a framework that advances both skill learning and composition to complete complex behaviors. Specifically, to prevent codebook collapse, we devise rotation-augmented residual skill quantization (RaRSQ).It encodes relative angles between encoder outputs into the gradient flow by rotation-based gradient mechanism. Points within the same skill code are forced to be either pushed apart or pulled closer together depending on gradient directions.Further, to capture the casual relationship between skills, we present causal skill transformer (CST) which explicitly models dependencies between skill representations through an autoregressive mechanism for coherent action generation.Extensive experiments demonstrate the superiority of STAR on both LIBERO benchmark and realworld tasks, with around 12% improvement over the baselines.
Paperid:779
Authors:Yiran Wang · Chenshu Liu · Yunfan Li · Sanae Amani · Bolei Zhou · Lin Yang
Abstract:
The exploration \& exploitation dilemma poses significant challenges in reinforcement learning (RL). Recently, curiositybased exploration methods achieved great success in tackling hard-exploration problems. However, they necessitate extensive hyperparameter tuning on different environments, which heavily limits the applicability and accessibility of this line of methods. In this paper, we characterize this problem via analysis of the agent behavior, concluding the fundamental difficulty of choosing a proper hyperparameter. We then identify the difficulty and the instability of the optimization when the agent learns with curiosity. We propose our method, hyperparameter robust exploration (\textbf{Hyper}), which extensively mitigates the problem by effectively regularizing the visitation of the exploration and decoupling the exploitation to ensure stable training. We theoretically justify that \textbf{Hyper} is provably efficient under function approximation setting and empirically demonstrate its appealing performance and robustness in various environments.
Paperid:780
Authors:Hengyuan Hu · Aniket Das · Dorsa Sadigh · Nima Anari
Abstract:
Abstract:Denoising Diffusion Probabilistic Models (DDPMs) have emerged as powerful tools for generative modeling. However, their sequential computation requirements lead to significant inferencetime bottlenecks. In this work, we utilize the connection between DDPMs and Stochastic Localization to prove that, under an appropriate reparametrization, the increments of DDPM satisfy an exchangeability property. This general insight enables near-black-box adaptation of various performance optimization techniques from autoregressive models to the diffusion setting. To demonstrate this, we introduce _Autospeculative Decoding_ (ASD), an extension of the widely used speculative decoding algorithm to DDPMs that does not require any auxiliary draft models. Our theoretical analysis shows that ASD achieves a $\tilde{O}(K^{\frac{1}{3}})$ parallel runtime speedup over the $K$ step sequential DDPM. We also demonstrate that a practical implementation of autospeculative decoding accelerates DDPM inference significantly in various domains.
Paperid:781
Authors:Alexander Tyurin
Abstract:
Abstract:We study the classical optimization problem $\min_{x \in \mathbb{R}^d} f(x)$ and analyze the gradient descent (GD) method in both nonconvex and convex settings. It is wellknown that, under the $L$–smoothness assumption ($\|\| \nabla^2 f(x) \|\| \leq L$), the optimal point minimizing the quadratic upper bound $f(x_k) + \langle \nabla f(x_k), x_{k+1} - x_k \rangle + \frac{L}{2} \|\| x_{k+1} - x_k \|\|^2$ is $x_{k+1} = x_k - \gamma_k \nabla f(x_k)$ with step size $\gamma_k = \frac{1}{L}$. Surprisingly, a similar result can be derived under the $\ell$-generalized smoothness assumption ($\|\| \nabla^2 f(x) \|\| \leq \ell( \|\| \nabla f(x) \|\| )$). In this case, we derive the step size $$\gamma_k = \int_{0}^{1} \frac{d v}{\ell( \|\| \nabla f(x_k) \|\| + \|\| \nabla f(x_k) \|\| v)}.$$ Using this step size rule, we improve upon existing theoretical convergence rates and obtain new results in several previously unexplored setups.
Paperid:782
Authors:Sally Zhu · Ahmed Ahmed · Rohith Kuditipudi · Percy Liang
Abstract:
Motivated by liability and intellectual property concerns over openweight models we consider the following problem: given the weights of two models, can we test whether they were trained independently---i.e., from independent random initializations? We consider two settings:constrainedandunconstrained. In the constrained setting, we make assumptions about model architecture and training and propose a family of statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. These p-values are valid regardless of the composition of either model's training data; we compute them by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures of weights and activations between the original two models versus these copies. We report the p-values from these tests on pairs of 21 open-weight models (210 total pairs) and find we correctly identify all pairs of non-independent models. Notably, our tests remain effective even if one of the models was fine-tuned for many tokens; we accurately detect dependence between Llama 2-7B and Llemma-7B, even though the latter was fine-tuned on an additional 750B tokens (37.5% of the original Llama 2-7B training budget). In the unconstrained setting, where we make no assumptions about training procedures, can change model architecture, and allow for adversarial evasion attacks, the previous tests no longer work; notably, an adversary can evade detection with simple transformations of model weights (e.g., permuting hidden units) that do not change model output. Instead, we propose a new test which matches hidden activations between two models, and use it to construct a test that is robust to these transformations and to changes in model architecture; the test can also performlocalized testing: identifying specific non-independent components of models. Though we no longer obtain exact p-values from this test, empirically we find it reliably distinguishes non-independent models like a p-value. Notably, we can use the test to identify specific parts of one model that are derived from another (e.g., how Llama 3.1-8B was pruned to initialize Llama 3.2-3B, or shared layers between Mistral-7B and StripedHyena-7B), and it is even robust to retraining individual layers of either model from scratch.
Paperid:783
Authors:Yao Shu · Wenyang Hu · See-Kiong Ng · Bryan Kian Hsiang Low · Fei Yu
Abstract:
Large Language Models (LLMs) have become indispensable in numerous realworld applications. However, fine-tuning these models at scale, especially in federated settings where data privacy and communication efficiency are critical, presents significant challenges. Existing approaches often resort to parameter-efficient fine-tuning (PEFT) to mitigate communication overhead, but this typically comes at the cost of model accuracy. To this end, we propose federated full-parameter tuning at scale for LLMs (Ferret), the first first-order method with shared randomness to enable scalable full-parameter tuning of LLMs across decentralized data sources while maintaining competitive model accuracy. Ferret accomplishes this through three aspects: (1) it employs widely used first-order methods for efficient local updates; (2) it projects these updates into a low-dimensional space to considerably reduce communication overhead; and (3) it reconstructs local updates from this low-dimensional space with shared randomness to facilitate effective full-parameter global aggregation, ensuring fast convergence and competitive final performance. Our rigorous theoretical analyses and insights along with extensive experiments, show that Ferret significantly enhances the scalability of existing federated full-parameter tuning approaches by achieving high computational efficiency, reduced communication overhead, and fast convergence, all while maintaining competitive model accuracy.
Paperid:784
Authors:Matteo Gasparin · Aaditya Ramdas
Abstract:
Abstract:Vovk (2015) introduced crossconformal prediction, a modification of split conformal designed to improve the width of prediction sets. The method, when trained with a miscoverage rate equal to $\alpha$ and $n \gg K$, ensures a marginal coverage of at least $1 - 2\alpha - 2(1-\alpha)(K-1)/(n+K)$, where $n$ is the number of observations and $K$ denotes the number of folds. A simple modification of the method achieves coverage of at least $1-2\alpha$. In this work, we propose new variants of both methods that yield smaller prediction sets without compromising the latter theoretical guarantee. The proposed methods are based on recent results deriving more statistically efficient combination of p-values that leverage exchangeability and randomization. Simulations confirm the theoretical findings and bring out some important tradeoffs.
Paperid:785
Authors:Ahmad Rashid · Ruotian Wu · Rongqi Fan · Hongliang Li · Agustinus Kristiadi · Pascal Poupart
Abstract:
Reward guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training such as in RLHF methods (PPO and DPO). However, they rely on a reward model to score each candidate token generated by the language model at inference and incur significant overhead.Additionally, the reward model is trained to score full sequences only, which can lead to suboptimal choices for partial sequences. In this work, we present a novel reward model architecture which is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a single call to the reward model. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze RGTG reward models and demonstrate that baseline reward models prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference, compared to other RGTG methods, with fewer calls to the reward model and competitive performance compared to both RGTG and RLHF.
Paperid:786
Authors:Dongzhe Zheng · Wenjie Mei
Abstract:
Learning unknown dynamics under environmental (or external) constraints is fundamental to many fields (e.g., modern robotics), particularly challenging when constraint information is only locally available and uncertain. Existing approaches requiring global constraints or using probabilistic filtering fail to fully exploit the geometric structure inherent in local measurements (by using, e.g., sensors) and constraints. This paper presents a geometric framework unifying measurements, constraints, and dynamics learning through a fiber bundle structure over the state space. This naturally induced geometric structure enables measurementaware Control Barrier Functions that adapt to local sensing (or measurement) conditions. By integrating Neural ODEs, our framework learns continuous-time dynamics while preserving geometric constraints, with theoretical guarantees of learning convergence and constraint satisfaction dependent on sensing quality. The geometric framework not only enables efficient dynamics learning but also suggests promising directions for integration with reinforcement learning approaches. Extensive simulations demonstrate significant improvements in both learning efficiency and constraint satisfaction over traditional methods, especially under limited and uncertain sensing conditions.
Paperid:787
Authors:Kei Sen Fong · Mehul Motani
Abstract:
Symbolic Regression (SR) algorithms select expressions based on prediction performance (i.e., Rsquared) while also keeping the expression lengths short to produce more explainable white box models. In this context, SR algorithms can be evaluated by measuring the extent to which the expressions discovered are Pareto-optimal, in the sense of having the best R-squared score for the expression length. This evaluation is most commonly done based onrelativeperformance, in the sense that an SR algorithm is judged on whether it Pareto-dominates other SR algorithms selected in the analysis, without any indication on efficiency or achievable limits. In this paper, we exploreabsolutePareto-optimal solutions instead, which have the optimal tradeoff between the multiple SR objectives (i.e., R-squared and expression size) by conducting an empirical investigation that exhaustively searches expressions with selected sizes. Specifically, we find an absolute Pareto-optimal (APO) front of expressions for 34 real-world datasets from the widely-used SR benchmark, SRBench, by performing exhaustive search. Additionally, we utilize eight different numerical optimization methods and analyze the performance differences to evaluate the choice of numerical optimization methods in state-of-the-art SR algorithms. We extract, for every real-world dataset, an APO front of expressions that can serve as a universal baseline for SR algorithms that informs researchers of the best achievable performance for selected sizes. The APO front we provide serves as an important benchmark that serves as a performance limit for SR algorithms and is made publicly available to reduce the computational burden on SR benchmarking.
Paperid:788
Authors:Qiyu Zhong · Yi Shan · Haobo Wang · Zhen Yang · Gengyu Lyu
Abstract:
In multiview multi-label classification (MVML), each object has multiple heterogeneous views and is annotated with multiple labels. The key to deal with such problem lies in how to capture cross-view consistent correlations while excavate multi-label semantic relationships. Existing MVML methods usually employ two independent components to address them separately, and ignores their potential interaction relationships. To address this issue, we propose a novel Tensorized MVML method named TMvML, which formulates an MVML tensor classifier to excavate comprehensive cross-view feature correlations while characterize complete multi-label semantic relationships. Specifically, we first reconstruct the MVML mapping matrices as an MVML tensor classifier. Then, we rotate the tensor classifier and introduce a low-rank tensor constraint to ensure view-level feature consistency and label-level semantic co-occurrence simultaneously. To better characterize the low-rank tensor structure, we design a new Laplace Tensor Rank (LTR), which serves as a tighter surrogate of tensor rank to capture high-order fiber correlations within the tensor space. By conducting the above operations, our method can easily address the two key challenges in MVML via a concise LTR tensor classifier and achieve the extraction of both cross-view consistent correlations and multi-label semantic relationships simultaneously. Extensive experiments demonstrate that TMvML significantly outperforms state-of-the-art methods.
Paperid:789
Authors:Dongchi Huang · Kaige Zhang · Jiaqi WANG · Yang Li · Tianle Zhang · Chunhe Xia
Abstract:
Partial observability presents a significant challenge for safe reinforcement learning, as it impedes the identification of potential risks and rewards. Leveraging specific types of privileged information during training to mitigate the effects of partial observability has yielded notable empirical successes. In this paper, we propose Asymmetric Constrained Partially Observable Markov Decision Processes (ACPOMDPs) to theoretically examine the advantages of incorporating privileged information. Building upon ACPOMDPs, we propose the Privileged Information Guided Dreamer, a modelbased safe reinforcement learning approach that leverages privileged information to enhance the agent's safety and performance through privileged representation alignment and an asymmetric actor-critic structure. Our empirical results on the Safety-Gymnasium benchmark demonstrate that our approach significantly outperforms existing methods in terms of safety and task-centric performance. Meanwhile, compared to alternative privileged model-based reinforcement learning methods, our approach exhibits superior performance and ease of training.
Paperid:790
Authors:Yuchen Guan · Runxi Cheng · Kang Liu · Chun Yuan
Abstract:
Abstract:Knowledge distillation typically employs the KullbackLeibler (KL) divergence to constrain the output of the student model to precisely match the soft labels provided by the teacher model. However, the optimization process of KL divergence is challenging for the student and prone to suboptimal points. Also, we demonstrate that the gradients provided by KL divergence depend on channel scale and thus tend to overlook low-probability channels. The mismatch in low-probability channels also results in the neglect of inter-class relationship information, making it difficult for the student to further enhance performance. To address this issue, we propose an auxiliary ranking loss based on Kendall’s $\tau$ Coefficient, which can be plug-and-play in any logit-based distillation method, providing inter-class relationship information and balancing the attention to low-probability channels. We show that the proposed ranking loss is less affected by channel scale, and its optimization objective is consistent with that of KL divergence. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that the proposed ranking loss can be plug-and-play on various baselines and enhance their performance.
Paperid:791
Authors:Zexu Sun · Qiyu Han · Hao Yang · Anpeng Wu · Minqin Zhu · Dugang Liu · Chen Ma · Yunpeng Weng · Xing Tang · xiuqiang He
Abstract:
In online platforms, incentives (\textit{e.g}., discounts, coupons) are used to boost user engagement and revenue. Uplift modeling methods are developed to estimate user responses from observational data, often incorporating distribution balancing to address selection bias. However, these methods are limited by indistribution testing data, which mirrors the training data distribution. In reality, user features change continuously due to time, geography, and other factors, especially on complex online marketing platforms. Thus, effective uplift modeling method for out-of-distribution data is crucial. To address this, we propose a novel uplift modeling method \textbf{I}nvariant \textbf{D}eep \textbf{U}plift \textbf{M}odeling, namely \textbf{IDUM}, which uses invariant learning to enhance out-of-distribution generalization by identifying causal factors that remain consistent across domains. IDUM further refines these features into necessary and sufficient factors and employs a masking component to reduce computational costs by selecting the most informative invariant features. A balancing discrepancy component is also introduced to mitigate selection bias in observational data. We conduct extensive experiments on public and real-world datasets to demonstrate IDUM's effectiveness in both in-distribution and out-of-distribution scenarios in online marketing. Furthermore, we also provide theoretical analysis and related proofs to support our IDUM's generalizability.
Paperid:792
Authors:Lifu Liu · Shiyuan He · Jianhua Guo
Abstract:
Efficiently counting Markov equivalent directed acyclic graphs (DAGs) is crucial in graphical causal analysis. Wienöbst et al. (2023) introduced a polynomialtime algorithm, known as the clique-picking algorithm, to count Markov equivalent DAGs for a given completed partially directed acyclic graph (CPDAG). This algorithm iteratively selects a root clique, determines fixed orientations with outgoing edges from the clique, and generates the unresolved undirected connected components (UCCGs). In this work, we propose a more efficient approach to UCCG generation by utilizing previously computed results for different root cliques. Our method introduces the concept of super cliques within rooted clique trees, enabling their efficient transfer between trees with different root cliques. The proposed algorithm effectively reduces the computational complexity of the clique-picking method, particularly when the number of cliques is substantially smaller than the number of vertices and edges.
Paperid:793
Authors:Xuchuang Wang · Qirun Zeng · Jinhang Zuo · Xutong Liu · Mohammad Hajiesmaili · John C. S. Lui · Adam Wierman
Abstract:
This paper investigates the fusion of absolute (reward) and relative (dueling) feedback in stochastic bandits, where both feedback types are gathered in each decision round. We derive a regret lower bound, demonstrating that an efficient algorithm may incur only the smaller among the reward and duelingbased regret for each individual arm. We propose two fusion approaches: (1) a simple elimination fusion algorithm that leverages both feedback types to explore all arms and unifies collected information by sharing a common candidate arm set, and (2) a decomposition fusion algorithm that selects the more effective feedback to explore the corresponding arms and randomly assigns one feedback type for exploration and the other for exploitation in each round. The elimination fusion experiences a suboptimal multiplicative term of the number of arms in regret due to the intrinsic suboptimality of dueling elimination. In contrast, the decomposition fusion achieves regret matching the lower bound up to a constant under a common assumption. Extensive experiments confirm the efficacy of our algorithms and theoretical results.
Paperid:794
Authors:Jiayu Liu · Zhenya Huang · Wei Dai · Cheng Cheng · Jinze Wu · Jing Sha · Song Li · Qi Liu · Shijin Wang · Enhong Chen
Abstract:
Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose \textbf{CogMath}, which comprehensively assess LLMs' mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: \emph{problem comprehension}, \emph{problem solving}, and \emph{solution summarization}. To carry out a scientific evaluation within each stage, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, to design a total of 9 finegrained evaluation dimensions. In each dimension, we develop an ``\emph{Inquiry}-\emph{Judge}-\emph{Reference}'' multi-agent system to generate inquiries that assess LLMs' mastery from this dimension. A LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks that cover the full K-12 math curriculum, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30\%-40\%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.
Paperid:795
Authors:Xiaoyan Hu · Ho-fung Leung · Farzan Farnia
Abstract:
Abstract:Selecting a sample generation scheme from multiple textbased generative models is typically addressed by choosing the model that maximizes an averaged evaluation score. However, this score-based selection overlooks the possibility that different models achieve the best generation performance for different types of text prompts. An online identification of the best generation model for various input prompts can reduce the costs associated with querying sub-optimal models. In this work, we explore the possibility of varying rankings of text-based generative models for different text prompts and propose an online learning framework to predict the best data generation model for a given input prompt. The proposed {\UCBName} algorithm addresses a contextual bandit (CB) setting with shared context variables across the arms, utilizing the generated data to update kernel-based functions that predict the score of each model available for unseen text prompts. Additionally, we leverage random Fourier features (RFF) to accelerate the online learning process of {\UCBName} and establish a $\widetilde{\mathcal{O}}(\sqrt{T})$ regret bound for the proposed RFF-based CB algorithm over $T$ iterations. Our numerical experiments on real and simulated text-to-image and image-to-text generative models show that RFF-UCB performs successfully in identifying the best generation model across different sample types.
Paperid:796
Authors:Steve Azzolin · SAGAR MALHOTRA · Andrea Passerini · Stefano Teso
Abstract:
SelfExplainable Graph Neural Networks (SE-GNNs) are popular explainable-by-design GNNs, but the properties and the limitations of their explanations are not well understood.Our first contribution fills this gap by formalizing the explanations extracted by SE-GNNs, referred to as Trivial Explanations (TEs), and comparing them to established notions of explanations, namely Prime Implicant (PI) and faithful explanations.Our analysis reveals that TEs match PI explanations for a restricted but significant family of tasks. In general, however, they can be less informative than PI explanations and are surprisingly misaligned with widely accepted notions of faithfulness.Although faithful and PI explanations are informative, they are intractable to find and we show that they can be prohibitively large.Motivated by this, we propose Dual-Channel GNNs that integrate a white-box rule extractor and a standard SE-GNN, adaptively combining both channels when the task benefits.Our experiments show that even a simple instantiation of Dual-Channel GNNs can recover succinct rules and perform on par or better than widely used SE-GNNs.Our code can be found in the supplementary material.
Paperid:797
Authors:Shan Zhang · Aotian Chen · Yanpeng Sun · Jindong Gu · Yi-Yu Zheng · Piotr Koniusz · Kai Zou · Anton Hengel · Yuan Xue
Abstract:
Mathematical diagrams have a distinctive structure, which standard feature transforms designed for natural images (e.g., CLIP) fail to process effectively, limiting their utility in multimodal large language models (MLLMs). Current efforts to improve MLLMs have primarily focused on scaling mathematical visual instruction datasets and strengthening LLM backbones, yet finegrained visual recognition errors remain unaddressed. Our systematic evaluation on the visual grounding capabilities of state-of-the-art MLLMs highlights that fine-grained visual understanding remains a crucial bottleneck in visual mathematical reasoning (GPT-4o exhibits a 70\% grounding error rate, and correcting these errors improves reasoning accuracy by 12\%). We thus propose a novel approach featuring a geometrically-groundedvision encoder and a feature router that dynamically selects between hierarchical visual feature maps. Our model accurately recognizes visual primitives and generates precise visual prompts aligned with the language model's reasoning needs. In experiments, PRIMITIVE-Qwen2.5-7B outperforms other 7B models by 12\% on MathVerse and is on par with GPT-4V on MathVista. Our findings highlight the need for better fine-grained visual integration in MLLMs. We are preparing to release the model code and weights.
Paperid:798
Authors:Chiqiang Liu · Dazi Li
Abstract:
Cooperative multiagent reinforcement learning faces significant challenges in effectively organizing agent relationships and facilitating information exchange, particularly when agents need to adapt their coordination patterns dynamically. This paper presents a novel framework that integrates dynamic spectral clustering with hypergraph neural networks to enable adaptive group formation and efficient information processing in multi-agent systems. The proposed framework dynamically constructs and updates hypergraph structures through spectral clustering on agents' state histories, enabling higher-order relationships to emerge naturally from agent interactions. The hypergraph structure is enhanced with attention mechanisms for selective information processing, providing an expressive and efficient way to model complex agent relationships. This architecture can be implemented in both value-based and policy-based paradigms through a unified objective combining task performance with structural regularization. Extensive experiments on challenging cooperative tasks demonstrate that our method significantly outperforms state-of-the-art approaches in both sample efficiency and final performance.
Paperid:799
Authors:Yucheng Li · Huiqiang Jiang · Chengruidong Zhang · Qianhui Wu · Xufang Luo · Surin Ahn · Amir Abdi · Dongsheng Li · Jianfeng Gao · Yuqing Yang · Lili Qiu
Abstract:
The integration of longcontext capabilities with visual understanding opens up new possibilities for Vision Language Models (VLMs). However, the quadratic attention complexity during the prefilling stage remains a major bottleneck, restricting wide deployment in real-world applications. To address this, we propose MAPSparse (Modality-Aware Permutation Sparse Attention), a dynamic sparse attention method that accelerates the pre-filling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse patterns, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundaries issue. By offline searching the optimal sparse patterns for each head, MAPSparse constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MAPSparse integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks—including Video QA, Captioning, Vision-NIAH, and Mix Modality-NIAH—with state-of-the-art long-context VLMs (LongVila and Llava-Video) show that MAPSparse accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining competitive performance.
Paperid:800
Authors:Edward Pearce-Crump
Abstract:
Abstract:Incorporating permutation equivariance into neural networks has proven to be useful in ensuring that models respect symmetries that exist indata. Symmetric tensors, which naturally appearin statistics, machine learning, and graph theory,are essential for many applications in physics,chemistry, and materials science, amongst others. However, existing research on permutationequivariant models has not explored symmetrictensors as inputs, and most prior work on learningfrom these tensors has focused on equivarianceto Euclidean groups. In this paper, we presenttwo different characterisations of all linear permutation equivariant functions between symmetricpower spaces of $\mathbb{R}^{n}$. We show on two tasks thatthese functions are highly data efficient comparedto standard MLPs and have potential to generalisewell to symmetric tensors of different sizes.
Paperid:801
Authors:Yifang Chen · Xiaoyu Li · Yingyiu Liang · Zhenmei Shi · Zhao Song
Abstract:
We investigate the fundamental limits of transformerbased foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any image-to-image Lipschitz functions. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction'' framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having the state-of-the-art performance for image synthesis tasks. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.
Paperid:802
Authors:Xingyue Huang · Pablo Barcelo · Michael Bronstein · Ismail Ceylan · Mikhail Galkin · Juan Reutter · Miguel Romero Orth
Abstract:
Knowledge Graph Foundation Models (KGFMs) are at the frontier for deep learning on knowledge graphs (KGs), as they can generalize to completely novel knowledge graphs with different relational vocabularies. Despite their empirical success, our theoretical understanding of KGFMs remains very limited. In this paper, we conduct a rigorous study of the expressive power of KGFMs. Specifically, we show that the expressive power of KGFMs directly depends on themotifsthat are used to learn the relation representations. We then observe that the most typical motifs used in the existing literature arebinary, as the representations are learned based on how pairs of relations interact, which limits the model's expressiveness. As part of our study, we design more expressive KGFMs using richer motifs, which necessitate learning relation representations based on, e.g., how triples of relations interact with each other. Finally, we empirically validate our theoretical findings, showing that the use of richer motifs results in better performance on a wide range of datasets drawn from different domains.
Paperid:803
Authors:Idan Achituve · Idit Diamant · Hai Victor Habi · Amir Rosenfeld · Arnon Netzer · Ethan Fetaya
Abstract:
In image processing, solving inverse problems is the task of finding plausible reconstructions of an image that was corrupted by some (usually known) degradation model. Commonly, this process is done using a generative image model that can guide the reconstruction towards solutions that appear natural. The success of diffusion models over the last few years has made them a leading candidate for this task. However, the sequential nature of diffusion models makes this conditional sampling process challenging. Furthermore, since diffusion models are often defined in the latent space of an autoencoder, the encoderdecoder transformations introduce additional difficulties. Here, we suggest a novel sampling method based on sequential Monte Carlo (SMC) in the latent space of diffusion models. We use the forward process of the diffusion model to add additional auxiliary observations and then perform an SMC sampling as part of the backward process. Empirical evaluations on ImageNet and FFHQ show the benefits of our approach over competing methods on various inverse problem tasks.
Paperid:804
Authors:Siddhant Agarwal · Harshit Sikchi · Peter Stone · Amy Zhang
Abstract:
Having explored an environment, intelligent agents should be able to transfer their knowledge to most downstream tasks within that environment without additional interactions. Referred to as "zeroshot learning," this ability remains elusive for general-purpose reinforcement learning algorithms. While recent works have attempted to produce zero-shot RL agents, they make assumptions about the nature of the tasks or the structure of the MDP. We presentProto Successor Measure: the basis set for all possible behaviors of a Reinforcement Learning Agent in a dynamical system. We prove that any possible behavior (represented using visitation distributions) can be represented using an affine combination of these policy-independent basis functions. Given a reward function at test time, we simply need to find the right set of linear weights to combine these bases corresponding to the optimal policy. We derive a practical algorithm to learn these basis functions using reward-free interaction data from the environment and show that our approach can produce the optimal policy at test time for any given reward function without additional environmental interactions.
Paperid:805
Authors:Jingxiang Qu · Wenhan Gao · Jiaxing Zhang · Xufeng Liu · Hua Wei · Haibin Ling · Yi Liu
Abstract:
3D Geometric Graph Neural Networks (GNNs) have emerged as transformative tools for modeling molecular data. Despite their predictive power, these models often suffer from limited interpretability, raising concerns for scientific applications that require reliable and transparent insights. While existing methods have primarily focused on explaining molecular substructures in 2D GNNs, the transition to 3D GNNs introduces unique challenges, such as handling the implicit dense edge structures created by a cutoff radius. To tackle this, we introduce a novel explanation method specifically designed for 3D GNNs, which localizes the explanation to the immediate neighborhood of each node within the 3D space. Each node is assigned an radius of influence, defining the localized region within which message passing captures spatial and structural interactions crucial for the model's predictions. This method leverages the spatial and geometric characteristics inherent in 3D graphs. By constraining the subgraph to a localized radius of influence, the approach not only enhances interpretability but also aligns with the physical and structural dependencies typical of 3D graph applications, such as molecular learning.
Paperid:806
Authors:Zichen Liu · Guoji Fu · Chao Du · Wee Sun Lee · Min Lin
Abstract:
Abstract:Continual reinforcement learning (CRL) refers to a naturalistic setting where an agent needs to endlessly evolve, by trial and error, to solve multiple tasks that are presented sequentially. One of the largest obstacles to CRL is that the agent may forget how to solve previous tasks when learning a new task, known as catastrophic forgetting. In this paper, we propose to address this challenge by planning with online world models. Specifically, we learn a FollowThe-Leader shallow model online to capture the world dynamics, in which we plan using model predictive control to solve a set of tasks specified by any reward functions. The online world model is immune to forgetting by construction with a proven regret bound of $\mathcal{O}(\sqrt{K^2D\log(T)})$ under mild assumptions. The planner searches actions solely based on the latest online model, thus forming a FTL Online Agent (OA) that updates incrementally. To assess OA, we further design Continual Bench, a dedicated environment for CRL, and compare with several strong baselines under the same model-planning algorithmic framework. The empirical results show that OA learns continuously to solve new tasks while not forgetting old skills, outperforming agents built on deep world models with various continual learning techniques.
Paperid:807
Authors:Xiangyu Qu · Guojing Liu · Liang Li
Abstract:
Nonautoregressive Transformers (NATs) have garnered significant attention due to their efficient decoding compared to autoregressive methods. However, existing conditional dependency modeling schemes based on masked language modeling introduce a training-inference gap in NATs. For instance, while NATs sample target words during training to enhance input, this condition cannot be met during inference, and simply annealing the sampling rate to zero during training leads to model performance degradation. We demonstrate that this training-inference gap prevents NATs from fully realizing their potential. To address this, we propose an adaptive end-to-end quantization alignment training framework, which introduces a semantic consistency space to adaptively align NAT training, eliminating the need for target information and thereby bridging the training-inference gap. Experimental results demonstrate that our method achieves state-of-the-art performance among fully NAT models on major WMT benchmarks, delivering performance on par with autoregressive Transformer (AT) while being 17.0 times more efficient in inference.
Paperid:808
Authors:Soumya Basu
Abstract:
We study bandit learning in matching markets with twosided reward uncertainty, extending prior research primarily focused on single-sided uncertainty. Leveraging the concept of `super-stability from \citet{irving1994stable}, we demonstrate the advantage of the Extended Gale-Shapley (GS) algorithm over the standard GS algorithm in achieving true stable matchings under incomplete information. By employing the Extended GS algorithm, our centralized algorithm attains a logarithmic pessimal stable regret dependent on an instance-dependent admissible gap parameter. This algorithm is further adapted to a decentralized setting with a constant regret increase. Finally, we establish a novel centralized instance-dependent lower bound for binary stable regret, elucidating the roles of the admissible gap and super-stable matching in characterizing the complexity of stable matching with bandit feedback.
Paperid:809
Authors:Bhavna Gopal · Huanrui Yang · Jingyang Zhang · Mark Horton · Yiran Chen
Abstract:
Adversarial training (AT) enhances neural network robustness. Typically, AT updates all trainable parameters, but can lead to overfitting and increased errors on clean data. Research suggests that finetuning specific parameters may be more effective; however, methods for identifying these essential parameters and establishing effective optimization objectives remain inadequately addressed. We present CLAT, an innovative adversarial fine-tuning algorithm that mitigates adversarial overfitting by integrating "criticality" into the training process. Instead of tuning the entire model, CLAT identifies and fine-tunes fewer parameters in robustness-critical layers—those predominantly learning non-robust features—while keeping the rest of the model fixed. Additionally, CLAT employs a dynamic layer selection process that adapts to changes in layer criticality during training. Empirical results demonstrate that CLAT can be seamlessly integrated with existing adversarial training methods, enhancing clean accuracy and adversarial robustness by over 2% compared to baseline approaches.
Paperid:810
Authors:Cheryl Li · Tianyuan Xu · Yiwen Guo
Abstract:
Chainof-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) by generating natural language (NL) rationales that lead to the final answer. However, it struggles with numerical computation, which has somehow led to the development of program-aided techniques.Despite their potential, a persistent challenge remains: inconsistencies between LLM-reported reasoning steps and the logic in generated programs, which we term ``reasoning hallucinations." This stems from the inherent ambiguities of NL and the statistical nature of LLMs, which often lack rigorous logical coherence.To address this challenge, we propose a novel test-time scaling framework, Reasoning-as-Logic-Units (RaLU), which constructs a more reliable reasoning path by aligning logical units between the generated program and their corresponding NL descriptions.By decomposing the initially generated program into discrete units using static analysis, RaLU engages in an iterative dialogue with the LLM to judge, refine, and explain each unit.A rewind-and-correct mechanism ensures alignment between code statements and task requirements in each unit, ultimately forming a cohesive reasoning path under the program's logic, from which the model reaches a final solution.Our experiments demonstrate that RaLU significantly outperforms existing baselines in mathematical reasoning (GSM8K, MATH) and algorithmic reasoning (HumanEval+, MBPP+), underscoring its potential to advance LLM reasoning and programming by offering enhanced accuracy and interpretability.
Paperid:811
Authors:Mehrsa Pourya · Erich Kobler · Michael Unser · Sebastian Neumayer
Abstract:
Stateof-the-art image reconstruction often relies on complex, highly parameterized deep architectures. We propose an alternative: a data-driven reconstruction method inspired by the classic Tikhonov regularization. Our approach iteratively refines intermediate reconstructions by solving a sequence of quadratic problems. These updates have two key components: (i) learned filters to extract salient image features, and (ii) an attention mechanism that locally adjusts the penalty of filter responses. Our method achieves performance on par with leading plug-and-play and learned regularizer approaches while offering interpretability, robustness, and convergent behavior. In effect, we bridge traditional regularization and deep learning with a principled reconstruction approach.
Paperid:812
Authors:Hung Manh Pham · Aaqib Saeed · Dong Ma
Abstract:
The accurate interpretation of Electrocardiogram (ECG) signals is pivotal for diagnosing cardiovascular diseases. Integrating ECG signals with their accompanying textual reports holds immense potential to enhance clinical diagnostics through the combination of physiological data and qualitative insights. However, this integration faces significant challenges due to inherent modality disparities and the scarcity of labeled data for robust crossmodal learning. To address these obstacles, we propose D-BETA, a novel framework that pre-trains ECG and text data using a contrastive masked auto-encoder architecture. D-BETA uniquely combines the strengths of generative with enhanced discriminative capabilities to achieve robust cross-modal representations. This is accomplished through masked modality modeling, specialized loss functions, and an improved negative sampling strategy tailored for cross-modal alignment. Extensive experiments on five public datasets across diverse downstream tasks demonstrate that D-BETA significantly outperforms existing methods, achieving an average AUC improvement of 15\% in linear probing with only one percent of training data and 2\% in zero-shot performance without requiring training data over state-of-the-art models. These results highlight the effectiveness of D-BETA, underscoring its potential to advance automated clinical diagnostics through multi-modal representations.
Paperid:813
Authors:Pedro Cisneros-Velarde · Bhavesh Shrimali · Arindam Banerjee
Abstract:
Neural Operators that directly learn mappings between function spaces, such as Deep Operator Networks (DONs) and Fourier Neural Operators (FNOs), have received considerable attention. Despite the universal approximation guarantees for DONs and FNOs, there is currently no optimization convergence guarantee for learning such networks using gradient descent (GD). In this paper, we address this open problem by presenting a unified framework for optimization based on GD and applying it to establish convergence guarantees for both DONs and FNOs. In particular, we show that the losses associated with both of these neural operators satisfy two conditions—restricted strong convexity (RSC) and smoothness—that guarantee a decrease on their loss values due to GD. Remarkably, these two conditions are satisfied for each neural operator due to different reasons associated with the architectural differences of the respective models. One takeaway that emerges from the theory is that wider networks should lead to better optimization convergence for both DONs and FNOs. We present empirical results on canonical operator learning problems to support our theoretical results.
Paperid:814
Authors:Lusen Zhao · Zihan Huang · Ding Jianhao · Zhaofei Yu
Abstract:
ANNto-SNN conversion has emerged as a key approach to train Spiking Neural Networks (SNNs), particularly for Transformer architectures, as it maps pre-trained ANN parameters to SNN equivalents without requiring retraining, thereby preserving ANN accuracy while eliminating training costs. Among various coding methods used in ANN-to-SNN conversion, time-to-first-spike (TTFS) coding, which allows each neuron to at most one spike, offers significantly lower energy consumption. However, while previous TTFS-based SNNs have achieved comparable performance with convolutional ANNs, the attention mechanism and nonlinear layers in Transformer architectures remains a challenge by existing SNNs with TTFS coding. This paper proposes a new neuron structure for TTFS coding that expands its representational range and enhances the capability to process nonlinear functions, along with detailed designs of nonlinear neurons for different layers in Transformer. Experimental results on different models demonstrate that our proposed method can achieve high accuracy with significantly lower energy consumption. To the best of our knowledge, this is the first work to focus on converting Transformer to SNN with TTFS coding.
Paperid:815
Authors:Guozheng Ma · Lu Li · Zilin Wang · Li Shen · Pierre-Luc Bacon · Dacheng Tao
Abstract:
Effectively scaling up deep reinforcement learning models has proven notoriously difficult due to network pathologies during training, motivating various targeted interventions such as periodic reset and architectural advances such as layer normalization. Instead of pursuing more complex modifications, we show that introducing static network sparsity alone can unlock further scaling potential beyond their dense counterparts with stateof-the-art architectures. This is achieved through simple one-shot random pruning, where a predetermined percentage of network weights are randomly removed once before training. Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity and stronger resistance to optimization challenges like plasticity loss and gradient interference. We further extend our evaluation to visual and streaming RL scenarios, demonstrating the consistent benefits of network sparsity.
Paperid:816
Authors:Dang Nguyen · Zeman Li · MohammadHossein Bateni · Vahab Mirrokni · Meisam Razaviyayn · Baharan Mirzasoleiman
Abstract:
Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate humanreadable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that guarantees the convergence and performance of LLMs during fine-tuning on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text can guarantee convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data. At the same time, the generated text is guaranteed to be different from real data. Experiments on various classification tasks confirm the effectiveness of our proposed approach.
Paperid:817
Authors:Lianbo Ma · Jianlun Ma · Yuee Zhou · Guoyang Xie · Qiang He · Zhichao Lu
Abstract:
Mixed Precision Quantization (MPQ) has become an essential technique for optimizing neural network by determining the optimal bitwidth per layer. Existing MPQ methods, however, face a major hurdle: they require a computationally expensive search for quantization strategies on largescale datasets. To resolve this issue, we introduce a novel approach that first searches for quantization strategies on small datasets and then generalizes them to large-scale datasets. This approach simplifies the process, eliminating the need for large-scale quantization fine-tuning and only necessitating model weight adjustment. Our method is characterized by three key techniques: sharpness-aware minimization for enhanced quantized model generalization, implicit gradient direction alignment to handle gradient conflicts among different optimization objectives, and an adaptive perturbation radius to accelerate optimization. It offers advantages such as no intricate computation of feature maps and high search efficiency. Both theoretical analysis and experimental results validate our approach. Using the CIFAR10 dataset (just 0.5\% the size of ImageNet training data) for MPQ policy search, we achieved equivalent accuracy on ImageNet with a significantly lower computational cost, while improving efficiency by up to 150\% over the baselines. Codes are available at https://anonymous.4open.science/r/ASGA-main-2221/.
Paperid:818
Authors:Puning Yang · Qizhou Wang · Zhuo Huang · Tongliang Liu · Chengqi Zhang · Bo Han
Abstract:
Loss reweighting has shown significant benefits for machine unlearning with large language models (LLMs). However, their exact functionalities are left unclear and the optimal strategy remains an open question, thus impeding the understanding and improvement of existing methodologies. In this paper, we identify two distinct goals of loss reweighting, namely, Saturation and Importancethe former indicates that those insufficiently optimized data should be emphasized, while the latter stresses some critical data that are most influential for loss minimization. To study their usefulness, we design specific reweighting strategies for each goal and evaluate their respective effects on unlearning. We conduct extensive empirical analyses on well-established benchmarks, and summarize some important observations as follows:(i) Saturation enhances efficacy more than importance-based reweighting, and their combination can yield additional improvements.(ii) Saturation typically allocates lower weights to data with lower likelihoods, whereas importance-based reweighting does the opposite.(iii) The efficacy of unlearning is also largely influenced by the smoothness and granularity of the weight distributions.Based on these findings, we propose SatImp, a simple reweighting method that combines the advantages of both saturation and importance.Empirical results on extensive datasets validate the efficacy of our method, potentially bridging existing research gaps and indicating directions for future research.
Paperid:819
Authors:Clara Fannjiang · Ji Won Park
Abstract:
Algorithms for machine learningguided design, ordesign algorithms, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task—for example, to design novel proteins with high binding affinity to a therapeutic target—one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. How can these decisions be made such that the resulting designs are successful? This paper proposes a method fordesign algorithm selection, which aims to select design algorithms that will produce a distribution of design labels satisfying a user-specified success criterion—for example, that at least ten percent of designs' labels exceed a threshold. It does so by combining designs' predicted property values with held-out labeled data to reliably forecast characteristics of the label distributions produced by different design algorithms, building upon prediction-powered inference techniques (Angelopoulos et al., 2023). The method is guaranteed with high probability to return design algorithms that yield successful label distributions (or the null set if none exist), if the density ratios between the design and labeled data distributions are known. We demonstrate the method's effectiveness in simulated protein and RNA design tasks, in settings with both known and estimated density ratios.
Paperid:820
Authors:Junyi Lu · Lili Jiang · Xiaojia Li · Jianbing Fang · Fengjun Zhang · Li Yang · Chun Zuo
Abstract:
The complexity of code reviews has driven efforts to automate review comments, but prior approaches oversimplify this task by treating it as snippetlevel code-to-text generation and relying on text similarity metrics like BLEU for evaluation. These methods overlook repository context, real-world merge request evaluation, and defect detection, limiting their practicality. To address these issues, we explore the full automation pipeline within the online recommendation service of a company with nearly 400 million daily active users, analyzing industry-grade codebases comprising hundreds of thousands of lines of code. We identify four key challenges: 1) capturing relevant context, 2) improving key bug inclusion (KBI), 3) reducing false alarm rates (FAR), and 4) integrating human workflows. To tackle these, we propose 1) code slicing algorithms for context extraction, 2) a multi-agent LLM framework for KBI, 3) a filtering mechanism for FAR reduction, and 4) a novel prompt design for better human interaction. Our approach, validated on real-world merge requests from historical fault reports, achieves a 2× improvement over standard LLMs and a 10× gain over previous baselines.
Paperid:821
Authors:Qilin Liao · Shuo Yang · Bo Zhao · Ping Luo · Hengshuang Zhao
Abstract:
Harnessing the power of diffusion models to synthesize auxiliary training data based on latent space features has proven effective in enhancing outof-distribution (OOD) detection performance. However, extracting effective features outside the in-distribution (ID) boundary in latent space remains challenging due to the difficulty of identifying decision boundaries between classes. This paper proposes a novel framework called Boundary-based Out-Of-Distribution data generation (BOOD), which synthesizes high-quality OOD features and generates human-compatible outlier images using diffusion models. BOOD first learns a text-conditioned latent feature space from the ID dataset, selects ID features closest to the decision boundary, and perturbs them to cross the decision boundary to form OOD features. These synthetic OOD features are then decoded into images in pixel space by a diffusion model. Compared to previous works, BOOD provides a more efficient strategy for synthesizing informative OOD features, facilitating clearer distinctions between ID and OOD data. Extensive experimental results on common benchmarks demonstrate that BOOD surpasses the state-of-the-art method significantly, achieving a 29.64\% decrease in average FPR95 (40.31\% vs. 10.67\%) and a 7.27\% improvement in average AUROC (90.15\% vs. 97.42\%) on the Cifar-100 dataset.
Paperid:822
Authors:Zhonghao He · Tianyi Qiu · Tejasveer Chugh · Max Kleiman-Weiner
Abstract:
The training and deployment of large language models (LLMs) induce a feedback loop: models continually learn human beliefs from data, reinforce user beliefs with generated content, reabsorb those reinforced beliefs, and again feed them back to users, creating dynamics resembling an echo chamber.We articulate the hypothesis that the feedback loop with LLMs entrench the existing values and factual beliefs of users, leading to diversity loss and potentially thelock-inof ideas. Prompted by observations of diversity loss in real-world ChatGPT usage data, we study the lock-in hypothesis through data mining, agent-based LLM simulation, and formal modeling.
Paperid:823
Authors:Wanyun Xie · Francesco Tonin · Volkan Cevher
Abstract:
Training data mixtures greatly impact the generalization performance of large language models. Existing domain reweighting methods often rely on costly weight computations and require retraining when new data is introduced. To this end, we introduce a flexible and efficient data mixing framework, Chameleon, that employs leverage scores to quantify domain importance within a learned embedding space. We first construct a domain affinity matrix over domain embeddings. The induced leverage scores determine a mixture that upweights domains sharing common representations in embedding space. This formulation allows direct transfer to new data by computing the new domain embeddings. In experiments, we demonstrate improvements over three key scenarios: (i) our computed weights improve performance on pretraining domains with a fraction of the compute of existing methods; (ii) Chameleon can adapt to data changes without proxy retraining, boosting fewshot reasoning accuracies when transferred to new data; (iii) our method enables efficient domain reweighting in finetuning, consistently improving test perplexity on all finetuning domains over uniform mixture.
Paperid:824
Authors:Josh Givens · Song Liu · Henry Reeve
Abstract:
Score matching is a vital tool for learning the distribution of data with applications across many areas including diffusion processes, energy based modelling, and graphical model estimation. Despite all these applications, little work explores its use when data is incomplete. We address this by adapting score matching (and its major extensions) to work with missing data in a flexible setting where data can be partially missing over any subset of the coordinates. We provide two separate score matching variations for general use, an importance weighting (IW) approach, and a variational approach. We provide finite sample bounds for our IW approach in finite domain settings and show it to have especially strong performance in small sample lower dimensional cases. Complementing this, we show our variational approach to be strongest in more complex highdimensional settings which we demonstrate on graphical model estimation tasks with both real and simulated data.
Paperid:825
Authors:Pedram Pad · Hadi Hammoud · Mohamad Dia · nadim maamari · Liza Dunbar
Abstract:
Abstract:Feature selection is a critical step in datadriven applications, reducing input dimensionality to enhance learning accuracy, computational efficiency, and interpretability. Existing state-of-the-art methods often require post-selection retraining and extensive hyperparameter tuning, complicating their adoption. We introduce a novel, non-intrusive feature selection layer that, given a target feature count $k$, automatically identifies and selects the $k$ most informative features during neural network training. Our method is uniquely simple, requiring no alterations to the loss function, network architecture, or post-selection retraining. The layer is mathematically elegant and can be fully described by:$$\tilde{x}_i = a_i x_i + (1-a_i)z_i$$where $x_i$ is the input feature, $\tilde{x}_i$ the output, $z_i$ a Gaussian noise, and $a_i$ trainable gain such that $\sum_i{a_i^2}=k$.This formulation induces an automatic clustering effect, driving $k$ of the $a_i$ gains to $1$ (selecting informative features) and the rest to $0$ (discarding redundant ones) via weighted noise distortion and gain normalization. Despite its extreme simplicity, our method delivers state-of-the-art performance on standard benchmark datasets and a novel real-world dataset, outperforming or matching existing approaches without requiring hyperparameter search for $k$ or retraining. Theoretical analysis in the context of linear regression further validates its efficacy. Our work demonstrates that simplicity and performance are not mutually exclusive, offering a powerful yet straightforward tool for feature selection in machine learning.
Paperid:826
Authors:Zhicong Huang · Zhicong Huang · Xinwen Hou · Cheng Hong
Abstract:
As Diffusion Models (DM) generate increasingly realistic images, related issues such as copyright and misuse have become a growing concern. Watermarking is one of the promising solutions. Existing methods inject the watermark into thesingledomainof initial Gaussian noise for generation, which suffers from unsatisfactory robustness. This paper presents the firstdual-domainDM watermarking approach using a pipelined injector to consistently embed watermarks in both the spatial and frequency domains. To further boost robustness against certain image manipulations and advanced attacks, we introduce a model-independent learnable Gaussian Noise Restorer (GNR) to refine Gaussian noise extracted from manipulated images and enhance detection robustness by integrating the detection scores of both watermarks.GaussMarker efficiently achieves state-of-the-art performance under eight image distortions and four advanced attacks across three versions of Stable Diffusion with better recall and lower false positive rates, as preferred in real applications.
Paperid:827
Authors:Zhaoxuan Kan · Jianan Mu · Husheng Han · shangyi shi · Tenghui Hua · Hang Lu · Xiaowei Li · Xing Hu
Abstract:
Abstract:Graph Convolutional Neural Networks (GCNs) have gained widespread popularity in various fields like personal healthcare and financial systems, due to their remarkable performance. Despite the growing demand for cloudbased GCN services, privacy concerns over sensitive graph data remain significant. Homomorphic Encryption (HE) facilitates Privacy-Preserving Machine Learning (PPML) by allowing computations to be performed on encrypted data. However, HE introduces substantial computational overhead, particularly for GCN operations that require rotations and multiplications in matrix products. The sparsity of GCNs offers significant performance potential, but their irregularity introduces additional operations that reduce practical gains. In this paper, we propose FicGCN, a HE-based framework specifically designed to harness the sparse characteristics of GCNs and strike a globally optimal balance between aggregation and combination operations. FicGCN employs a latency-aware packing scheme, a Sparse Intra-Ciphertext Aggregation (SpIntra-CA) method to minimize rotation overhead, and a region-based data reordering driven by local adjacency structure. We evaluated FicGCN on several popular datasets, and the results show that FicGCN achieved the best performance across all tested datasets, with up to a $4.10\times$ improvement over the latest design.
Paperid:828
Authors:Tong Wu · Junzhe Shen · Zixia Jia · Yuxuan Wang · Zilong Zheng
Abstract:
Abstract:Generating ultralong sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TokenSwift, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TokenSwift achieves over $\mathbf{3\times}$ speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TokenSwift as a scalable and effective solution at unprecedented lengths.
Paperid:829
Authors:Rui Ai · Boxiang Lyu · Zhaoran Wang · Zhuoran Yang · Haifeng Xu
Abstract:
Abstract:We develop a framework for capturing the instrumental value of data production processes, which accounts for two key factors: (a) the context of the agent's decisionmaking; (b) how much data or information the buyer already possesses. We "micro-found" our data valuation function by establishing its connection to classic notions of signals and information design in economics. When instantiated in Bayesian linear regression, our value naturally corresponds to information gain. Applying our proposed data value for monopoly pricing, we show that if the seller can fully customize data production, she can extract the first-best revenue (i.e., full surplus) from any population of buyers, i.e., achieving first-degree price discrimination. If data can only be constructed from an existing data pool, this limits the seller's ability to customize, and achieving first-best revenue becomes generally impossible. However, we design a mechanism that achieves seller revenue at most $\log (\kappa)$ less than the first-best, where $\kappa$ is the condition number associated with the data matrix. As a corollary, the seller extracts the first-best revenue in the multi-armed bandits special case.
Paperid:830
Authors:Yu Li · Felix Dangel · Derek Tam · Colin Raffel
Abstract:
The diagonal of a model's Fisher Information Matrix (the "Fisher") has frequently been used as a way to measure parameter sensitivity.Typically, the Fisher is estimated by computing the squared gradient of the model's outputs with respect to its parameters, averaged over a few hundred or thousand examples — a process which incurs nontrivial computational costs.At the same time, adaptive gradient methods like the ubiquitous Adam optimizer compute a moving average of the squared gradient over the course of training.This paper therefore explores whether an approximation of the Fisher can be obtained "for free" by recycling the squared gradient accumulator that has already been computed over the course of training.Through a comprehensive set of experiments covering five applications of the Fisher, we demonstrate that the "Squisher" (Squared gradient accumulator as an approximation of the Fisher) consistently performs similarly to the Fisher while outperforming baseline methods.Additionally, we clarify the exact differences between the Squisher and the Fisher and provide empirical quantification of their respective impact.
Paperid:831
Authors:Shuyuan Wang · Philip D. Loewen · Michael Forbes · Bhushan Gopaluni · Wei Pan
Abstract:
Abstract:Differentiable control promises endto-end differentiability and adaptability, effectively combining the advantages of both model-free and model-based control approaches. However, the iterative Linear Quadratic Regulator (iLQR), despite being a powerful nonlinear controller, has seen limited research efforts dedicated to enhancing its differentiability. The scalability of differentiating through extended iterations and horizons poses significant challenges, hindering iLQR from being an effective differentiable controller. This paper introduces a framework that facilitates differentiation through iLQR, allowing it to serve as a trainable and differentiable module, either as or within a neural network. A novel aspect of this framework is the analytical solution that it provides for the gradient of an iLQR controller through implicit differentiation, which ensures a constant backward cost regardless of iteration, while producing an accurate gradient. We evaluate our framework on imitation tasks on famous control benchmarks. Our analytical method demonstrates superior computational performance, achieving up to \textbf{128x speedup} and a minimum of \textbf{21x speedup} compared to automatic differentiation. Our method also demonstrates superior learning performance (\textbf{$\mathbf{10^6}$x}) compared to traditional neural network policies and better model loss with differentiable controllers that lack exact analytical gradients. Furthermore, we integrate our module into a larger network with visual inputs to demonstrate the capacity of our method for high-dimensional, fully end-to-end tasks.
Paperid:832
Authors:Tuan Dam · Kishan Panaganti · Brahim Driss · Adam Wierman
Abstract:
Abstract:Monte Carlo Tree Search (MCTS) is a powerful framework for solving complex decisionmaking problems, yet it often relies on the assumption that the simulator and the real-world dynamics are identical. Although this assumption helps achieve the success of MCTS in games like Chess, Go, and Shogi, real-world scenarios incur ambiguity due to their modeling mismatches in low-fidelity simulators. In this work, we present a new robust variant of MCTS that mitigates dynamical model ambiguities. Our algorithm addresses transition dynamics and reward distribution ambiguities to bridge the gap between simulation-based planning and real-world deployment. We incorporate a robust power mean backup operator and carefully designed exploration bonuses to ensure finite-sample convergence at every node in the search tree. We show that our algorithm achieves a convergence rate of $\mathcal{O}(n^{-1/2})$ for the value estimation at the root node, comparable to that of standard MCTS. Finally, we provide empirical evidence that our method achieves robust performance in planning problems even under significant ambiguity in the underlying dynamics.
Paperid:833
Authors:Bart Bussmann · Noa Nabeshima · Adam Karvonen · Neel Nanda
Abstract:
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting neural networks by extracting the concepts represented in their activations. However, choosing the size of the SAE dictionary (i.e. number of learned concepts) creates a tension: as dictionary size increases to capture more relevant concepts, sparsity incentivizes features to be split or absorbed into more specific features, leaving highlevel features missing or warped. We introduce Matryoshka SAEs, a novel variant that addresses these issues by simultaneously training multiple nested dictionaries of increasing size, forcing the smaller dictionaries to independently reconstruct the inputs without using the larger dictionaries. This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features. We train Matryoshka SAEs on Gemma-2-2B and TinyStories and find superior performance on sparse probing and targeted concept erasure tasks, more disentangled concept representations, and reduced feature absorption. While there is a minor tradeoff with reconstruction performance, we believe Matryoshka SAEs are a superior alternative for practical tasks, as they enable training arbitrarily large SAEs while retaining interpretable features at different levels of abstraction.
Paperid:834
Authors:Abdel-Rahim Mezidi · Jordan Patracone · Saverio Salzo · Amaury Habrard · Massimiliano Pontil · Rémi Emonet · Marc Sebban
Abstract:
We present several advances on neural operators by viewing the action of operator layers as the minimizers of Bregman regularized optimization problems over Banach function spaces. The proposed framework allows interpreting the activation operators as Bregman proximity operators from dual to primal space. This novel viewpoint is general enough to recover classical neural operators as well as a new variant, coined Bregman neural operators, which includes the inverse activation operator and features the same expressivity of standard neural operators. Numerical experiments support the added benefits of the Bregman variant of Fourier neural operators for training deeper and more accurate models.
Paperid:835
Authors:Roman Abramov · Felix Steinbauer · Gjergji Kasneci
Abstract:
Abstract:Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multistep factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns -- yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio $\phi_r$ of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95--100\% accuracy on 2WikiMultiHopQA -- substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing $\phi_r$ drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.
Paperid:836
Authors:Jiao Huang · Qianli Xing · Jinglong Ji · Bo Yang
Abstract:
Graph neural networks have recently demonstrated remarkable performance in predicting material properties. crystalline material data is manually encoded into graph representations, which can then be processed by these networks.Existing methods incorporate different attributes into constructing representations to satisfy the constraints arising from symmetries of material structure.However, existing methods for obtaining graph representations are specific to certain constraints, which is ineffective when facing new constraints.In this work, we propose a code generation framework with multiple large language model agents to obtain representations named RepCodeGen with three iterative stages simulating an evolutionary algorithm. As the codes generated by large language models are always of low quality when the task is difficult, Rep-CodeGen incorporates the evaluation and evolution of codes that can significantly improve the quality of generated codes.To the best of our knowledge, Rep-CodeGen is the first framework for automatically generating codes to obtain representations that can be used when facing new constraints. Furthermore, a type of representation from generated codes by our framework satisfies six constraints with codes satisfying three constraints as bases. Extensive experiments on two real-world material datasets show that a property prediction method based on such graph representation achieves state-of-the-art performance in material property prediction tasks.
Paperid:837
Authors:Vinod Raman · Hilal Asi · Kunal Talwar
Abstract:
Abstract:We design new differentially private algorithms for the problems of adversarial bandits and bandits with expert advice. For adversarial bandits, we give a simple and efficient conversion of any nonprivate bandit algorithm to a private bandit algorithm. Instantiating our conversion with existing non-private bandit algorithms gives a regret upper bound of $O\left(\frac{\sqrt{KT}}{\sqrt{\epsilon}}\right)$, improving upon the existing upper bound $O\left(\frac{\sqrt{KT \log(KT)}}{\epsilon}\right)$ for all $\epsilon \leq 1$. In particular, our algorithms allow for sublinear expected regret even when $\epsilon \leq \frac{1}{\sqrt{T}}$, establishing the first known separation between central and local differential privacy for this problem. For bandits with expert advice, we give the first differentially private algorithms, with expected regret $O\left(\frac{\sqrt{NT}}{\sqrt{\epsilon}}\right), O\left(\frac{\sqrt{KT\log(N)}\log(KT)}{\epsilon}\right)$, and $\tilde{O}\left(\frac{N^{1/6}K^{1/2}T^{2/3}\log(NT)}{\epsilon ^{1/3}} + \frac{N^{1/2}\log(NT)}{\epsilon}\right)$, where $K$ and $N$ are the number of actions and experts respectively. These rates allow us to get sublinear regret for different combinations of small and large $K, N$ and $\epsilon.$
Paperid:838
Authors:Jianshu Zhang · Xuankun Rong · Kun He · Mang Ye
Abstract:
Generative replay (GR) has been extensively validated in continual learning as a mechanism to synthesize data and replay past knowledge to mitigate forgetting. By leveraging synthetic rather than real data for the replay, GR has been adopted in some federated continual learning (FCL) approaches to ensure the privacy of clientside data. While existing GR-based FCL approaches have introduced improvements, none of their enhancements specifically take into account the unique characteristics of federated learning settings. Beyond privacy constraints, what other fundamental aspects of federated learning should be explored in the context of FCL?In this work, we explore the potential benefits that come from emphasizing the role of clients throughout the process.We begin by highlighting two key observations: (a) Client Expertise Superiority, where clients, rather than the server, act as domain experts, and (b) Client Forgetting Variance, where heterogeneous data distributions across clients lead to varying levels of forgetting.Building on these insights, we propose CAN (Clients As Navigators), highlighting the pivotal role of clients in both data synthesis and data replay. Extensive evaluations demonstrate that this client-centric approach achieves state-of-the-art performance. Notably, it requires a smaller buffer size, reducing storage overhead and enhancing computational efficiency.
Paperid:839
Authors:Binghui Li · Yuanzhi Li
Abstract:
Abstract:Similar to surprising performance in the standard deep learning, deep nets trained by adversarial training also generalize well for unseen clean data (natural data). However, despite adversarial training can achieve low robust training error, there exists a significant robust generalization gap. We call this phenomenon the Clean Generalization and Robust Overfitting (CGRO). In this work, we study the CGRO phenomenon in adversarial training from two views: representation complexity and training dynamics. Specifically, we consider a binary classification setting with $N$ separated training data points. First, we prove that, based on the assumption that we assume there is $\operatorname{poly}(D)$size clean classifier (where $D$ is the data dimension), ReLU net with only $O(N D)$ extra parameters is able to leverages robust memorization to achieve the CGRO, while robust classifier still requires exponential representation complexity in worst case. Next, we focus on a structured-data case to analyze training dynamics, where we train a two-layer convolutional network with $O(N D)$ width against adversarial perturbation. We then show that a three-stage phase transition occurs during learning process and the network provably converges to robust memorization regime, which thereby results in the CGRO. Besides, we also empirically verify our theoretical analysis by experiments in real-image recognition datasets.
Paperid:840
Authors:Fangchen Yu · Yanzhen Chen · Jiaxing Wei · Jianfeng Mao · Wenye Li · Qiang Sun
Abstract:
The Wasserstein distance is a widely used metric for measuring differences between distributions, but its supercubic time complexity introduces substantial computational burdens. To mitigate this, the tree-Wasserstein distance (TWD) provides a linear-time approximation by leveraging a tree structure. However, existing TWD methods often compromise accuracy due to suboptimal tree structures and inadequately tuned edge weights. To address these limitations, we propose UltraTWD, a novel framework of unsupervised TWD methods that simultaneously optimize ultrametric trees and edge weights. UltraTWD focuses on constructing a rooted tree with an ultrametric that closely approximates the cost matrix. To achieve this, we develop algorithms based on minimum spanning trees, iterative projection, and gradient descent, ensuring high accuracy and strong performance in downstream tasks. Extensive experiments on document retrieval, ranking, and classification highlight its effectiveness, positioning UltraTWD as a practical solution for Wasserstein-distance-based applications.
Paperid:841
Authors:Mike Ranzinger · Greg Heinrich · Pavlo Molchanov · Bryan Catanzaro · Andrew Tao
Abstract:
The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in visionlanguage models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones are Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off-the-shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at 224x224px, while the "high resolution" versions are around 378-448px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low-res vision encoders while picking up on fine-grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model (RADIO) training as a way of providing richer targets for distillation.
Paperid:842
Authors:Li Ding · Hao Zhang · Wenrui Dai · Chenglin Li · Weijia Lu · ZHIFEI YANG · xiaodong Zhang · Xiaofeng Ma · Junni Zou · Hongkai Xiong
Abstract:
Federated learning (FL) is greatly challenged by the communication bottleneck and computation limitation on clients. Existing compression methods for FL cannot simultaneously reduce the uplink and downlink communication cost and mitigate the computation burden on clients. To address this problem, in this paper, we propose the first lowbit integerized federated learning (LBI-FL) framework that quantizes the weights, activations, and gradients to lower than INT8 precision to evidently reduce the communication and computational costs. Specifically, we achieve dynamical temporal bit-width allocation for weights, activations, and gradients along the training trajectory via reinforcement learning. An agent is trained to determine bit-width allocation by comprehensively considering the states like current bit-width, training stage, and quantization loss as the state. The agent efficiently trained on small-scale datasets can be well generalized to train varying network architectures on non-independent and identically distributed datasets. Furthermore, we demonstrated in theory that federated learning with gradient quantization achieves an equivalent convergence rate to FedAvg. The proposed LBI-FL can reduce the communication costs by 8 times compared to full-precision FL. Extensive experiments show that the proposed LBI-FL achieves a reduction of more than 50% BitOPs per client on average for FL with less than 2% accuracy loss compared to low-bit training with INT8 precision.
Paperid:843
Authors:Masahiro Negishi · Thomas Gärtner · Pascal Welke
Abstract:
We investigate the distance function learned by message passing neural networks (MPNNs) in specific tasks, aiming to capture thefunctionaldistance between prediction targets that MPNNs implicitly learn. This contrasts with previous work, which links MPNN distances on arbitrary tasks tostructuraldistances on graphs that ignore taskspecific information. To address this gap, we distill the distance between MPNN embeddings into an interpretable graph distance. Our method uses optimal transport on the Weisfeiler Leman Labeling Tree (WILT), where the edge weights reveal subgraphs that strongly influence the distance between embeddings. This approach generalizes two well-known graph kernels and can be computed in linear time. Through extensive experiments, we demonstrate that MPNNs define the relative position of embeddings by focusing on a small set of subgraphs that are known to be functionally important in the domain.
Paperid:844
Authors:Joel Daniel Andersson · Boel Nelson · Rasmus Pagh · Lukas Retschmeier
Abstract:
Abstract:Koufogiannis et al. (TPDP '15) showed a *gradual release* result for Laplace noisebased differentially private mechanisms: given an $\varepsilon$-DP release, a new release with privacy parameter $\varepsilon' > \varepsilon$ can be computed such that the combined privacy loss of both releases is $\varepsilon'$ and the distribution of the latter is the same as a single release with parameter $\varepsilon'$.Whitehouse et al. (NeurIPS '22) presented gradual release techniques for Gaussian noise.In this paper, we consider a more general *multiple release* setting in which an analyst is able to create private releases with different privacy parameters corresponding to different access/trust levels.These releases are determined one by one, with privacy parameters in arbitrary order. A release is *lossless* if having access to a subset $S$ of the releases has the same privacy guarantee as the least private release in $S$, and each release has the same distribution as a single release with the same privacy parameter.Our main focus is on lossless multiple release mechanisms based on Gaussian noise.We give a simple method for creating multiple releases with a short, self-contained analysis that does not require knowledge of the mathematics of Brownian motion.Finally, we consider how to efficiently do gradual release of sparse histograms, and present a mechanism with running time independent of the number of dimensions.
Paperid:845
Authors:Eshika Saxena · Alberto Alfarano · Emily Wenger · Kristin Lauter
Abstract:
Abstract:Recent work showed that MLbased attacks on Learning with Errors (LWE), a hard problem used in post-quantum cryptography, outperform classical algebraic attacks in certain settings. Although promising, ML attacks struggle to scale to more complex LWE settings. Prior work connected this issue to the difficulty of training ML models to do modular arithmetic, a core feature of the LWE problem. To address this, we develop techniques that significantly boost the performance of ML models on modular arithmetic tasks—enabling the models to sum up to $N=128$ elements modulo $q \le 974269$. Our core innovation is the use of custom training data distributions and a carefully designed loss function that better represents the problem structure. We apply an initial proof of concept of our techniques to LWE specifically and find that they allow recovery of 2x harder secrets than prior work. Our techniques also help ML models learn other well-studied problems better, including copy, associative recall, and parity, motivating further study.
Paperid:846
Authors:Awni Altabaa · John Lafferty
Abstract:
Relational reasoning is a central component of generally intelligent systems, enabling robust and dataefficient inductive generalization. Recent empirical evidence shows that many existing neural architectures, including Transformers, struggle with tasks requiring relational reasoning. In this work, we distinguish between two types of information:sensoryinformation about the properties of individual objects, andrelationalinformation about the relationships between objects. While neural attention provides a powerful mechanism for controlling the flow of sensory information between objects, the Transformer lacks an explicit computational mechanism for routing and processing relational information. To address this limitation, we propose an architectural extension of the Transformer framework that we call theDual Attention Transformer (DAT), featuring two distinct attention mechanisms: sensory attention for directing the flow of sensory information, and a novel relational attention mechanism for directing the flow of relational information. We empirically evaluateDATon a diverse set of tasks ranging from synthetic relational benchmarks to complex real-world tasks such as language modeling and visual processing. Our results demonstrate that integrating explicit relational computational mechanisms into the Transformer architecture leads to significant performance gains in terms of data efficiency and parameter efficiency.
Paperid:847
Authors:Michael Sun · Orion Foo · Gang Liu · Wojciech Matusik · Jie Chen
Abstract:
Directed acyclic graphs (DAGs) are a class of graphs commonly used in practice, with examples that include electronic circuits, Bayesian networks, and neural architectures. While many effective encoders exist for DAGs, it remains challenging to decode them in a principled manner, because the nodes of a DAG can have many different topological orders. In this work, we propose a grammarbased approach to constructing a principled, compact and equivalent sequential representation of a DAG. Specifically, we view a graph as derivations over an unambiguous grammar, where the DAG corresponds to a unique sequence of production rules. Equivalently, the procedure to construct such a description can be viewed as a lossless compression of the data. Such a representation has many uses, including building a generative model for graph generation, learning a latent space for property prediction, and leveraging the sequence representational continuity for Bayesian Optimization over structured data.
Paperid:848
Authors:Jingfeng Wu · Peter Bartlett · Matus Telgarsky · Bin Yu
Abstract:
Abstract:In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$margin solution---a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk vanishes for early stopped GD but diverges to infinity for GD iterates at convergence. This suggests that early stopped GD is well-calibrated, whereas asymptotic GD is statistically inconsistent. Second, we show that to attain a small excess zero-one risk, polynomially many samples are sufficient for early stopped GD, while exponentially many samples are necessary for any interpolating estimator, including asymptotic GD. This separation underscores the statistical benefits of early stopping in the overparameterized regime. Finally, we establish nonasymptotic bounds on the norm and angular differences between early stopped GD and $\ell_2$-regularized empirical risk minimizer, thereby connecting the implicit regularization of GD with explicit $\ell_2$-regularization.
Paperid:849
Authors:Ayushi Mishra · Yang Bai · Priyadarshan Narayanasamy · Nakul Garg · Nirupam Roy
Abstract:
Integrating spatial context into large language models (LLMs) has the potential to revolutionize humancomputer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset called OmniTalk by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI’s Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of 25.72°—a substantial improvement compared to the 88.52° median error in existing work—with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16°. Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.
Paperid:850
Authors:Shuo Wang · Shunyang Huang · Jinghui Yuan · Zhixiang Shen · zhao kang
Abstract:
Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns across different semantic spaces. To address this limitation, we propose theCooperation of Experts (CoE)framework, which encodes multityped information into unified heterogeneous multiplex networks. By transcending modality and connection differences, CoE provides a powerful and flexible model for capturing the intricate structures of real-world complex data. In our framework, dedicated encoders act as domain-specific experts, each specializing in learning distinct relational patterns in specific semantic spaces. To enhance robustness and extract complementary knowledge, these experts collaborate through a novellarge marginmechanism supported by a tailored optimization strategy. Rigorous theoretical analyses guarantee the framework’s feasibility and stability, while extensive experiments across diverse benchmarks demonstrate its superior performance and broad applicability.
Paperid:851
Authors:Yuhang Li · Guoxu Zhou · Zhenhao Huang · Xinqi Chen · Yuning Qiu · Qibin Zhao
Abstract:
ClassIncremental Learning (CIL) has gained considerable attention due to its capacity to accommodate new classes during learning. Replay-based methods demonstrate state-of-the-art performance in CIL but suffer from high memory consumption to save a set of old exemplars for revisiting. To address this challenge, many memory-efficient replay methods have been developed by exploiting image compression techniques. However, the gains are often bittersweet when pixel-level compression methods are used. Here, we present a simple yet efficient approach that employs tensor decomposition to address these limitations. This method fully exploits the low intrinsic dimensionality and pixel correlation of images to achieve high compression efficiency while preserving sufficient discriminative information, significantly enhancing performance. We also introduce a hybrid exemplar selection strategy to improve the representativeness and diversity of stored exemplars. Extensive experiments across datasets with varying resolutions consistently demonstrate that our approach substantially boosts the performance of baseline methods, showcasing strong generalization and robustness.
Paperid:852
Authors:Mustapha Bounoua · Giulio Franzese · Pietro Michiardi
Abstract:
Multimodal data is a precious asset enabling a variety of downstream tasks in machine learning. However, realworld data collected across different modalities is often not paired, which is a significant challenge to learn a joint distribution. A prominent approach to address the modality coupling problem is Minimum Entropy Coupling (MEC), which seeks to minimize the joint Entropy, while satisfying constraints on the marginals. Existing approaches to the MEC problem focus on finite, discrete distributions, limiting their application for cases involving continuous data. In this work, we propose a novel method to solve the continuous MEC problem, using well-known generative diffusion models that learn to approximate and minimize the joint Entropy through a cooperative scheme, while satisfying a relaxed version of the marginal constraints.We empirically demonstrate that our method, DDMEC, is general and can be easily used to address challenging tasks, including unsupervised single-cell multi-omics data alignment and unpaired image translation, outperforming specialized methods.
Paperid:853
Authors:Yipeng Zhang · Longlong Li · Kelin Xia
Abstract:
Graph Neural Networks (GNNs) have proven effective for learning from graphstructured data through their neighborhood-based message passing framework. Many hierarchical graph clustering pooling methods modify this framework by introducing clustering-based strategies, enabling the construction of more expressive and powerful models. However, all of these message passing framework heavily rely on the connectivity structure of graphs, limiting their ability to capture the rich geometric features inherent in geometric graphs. To address this, we propose Rhomboid Tiling (RT) clustering, a novel clustering method based on the rhomboid tiling structure, which performs clustering by leveraging the complex geometric information of the data and effectively extracts its higher-order geometric structures. Moreover, we design RTPool, a hierarchical graph clustering pooling model based on RT clustering for graph classification tasks. The proposed model demonstrates superior performance, outperforming 19 state-of-the-art competitors on all the 7 benchmark datasets.
Paperid:854
Authors:Yiheng Li · Cunxin Fan · Chongjian GE · Seth Zhao · Chenran Li · Chenfeng Xu · Huaxiu Yao · Masayoshi Tomizuka · Bolei Zhou · Chen Tang · Mingyu Ding · Wei Zhan
Abstract:
Language models uncover unprecedented abilities in analyzing driving scenarios, owing to their limitless knowledge accumulated from textbased pre-training. Naturally, they should particularly excel in analyzing rule-based interactions, such as those triggered by traffic laws, which are well documented in texts. However, such interaction analysis remains underexplored due to the lack of dedicated language datasets that address it. Therefore, we propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a comprehensive large-scale Q&As dataset built on WOMD focusing on describing and reasoning traffic rule-induced interactions in driving scenarios. WOMD-Reasoning also presents by far the largest multi-modal Q&A dataset, with 3 million Q&As on real-world driving scenarios, covering a wide range of driving topics from map descriptions and motion status descriptions to narratives and analyses of agents' interactions, behaviors, and intentions. To showcase the applications of WOMD-Reasoning, we design Motion-LLaVA, a motion-language model fine-tuned on WOMD-Reasoning. Quantitative and qualitative evaluations are performed on WOMD-Reasoning dataset as well as the outputs of Motion-LLaVA, supporting the data quality and wide applications of WOMD-Reasoning, in interaction predictions, traffic rule compliance plannings, etc. The dataset, its vision modal extension, and the codes & prompts to build it are available in the supplementary materials.
Paperid:855
Authors:Ayman Chaouki · Jesse Read · Albert Bifet
Abstract:
Decision Tree (DT) Learning is a fundamental problem in Interpretable Machine Learning, yet it poses a formidable optimisation challenge. Practical algorithms have recently emerged, primarily leveraging Dynamic Programming and Branch \& Bound. However, most of these approaches rely on a DepthFirst-Search strategy, which is inefficient when searching for DTs at high depths and requires the definition of a maximum depth hyperparameter. Best-First-Search was also employed by other methods to circumvent these issues. The downside of this strategy is its higher memory consumption, as such, it has to be designed in a fully efficient manner that takes full advantage of the problem's structure. We formulate the problem as an AND/OR graph search which we solve with a novel AO*-type algorithm called Branches. We prove both optimality and complexity guarantees for Branches and we show that it is more efficient than the state of the art theoretically and on a variety of experiments. Furthermore, Branches supports non-binary features unlike the other methods, we show that this property can further induce larger gains in computational efficiency.
Paperid:856
Authors:Junyou Zhu · Langzhou He · Chao Gao · Dongpeng Hou · Zhen Su · Philip Yu · Juergen Kurths · Frank Hellmann
Abstract:
Diffusion probabilistic models (DPMs) have recently demonstrated impressive generative capabilities, and emerging evidence suggests that their sample reconstruction ability can yield meaningful representations for recognition tasks. In this paper, we demonstrate that the objectives underlying generation and representation learning are not perfectly aligned. Through spectral analysis, we find that minimizing the mean squared error (MSE) between the original graph and its reconstructed counterpart does not necessarily optimize representations for downstream tasks. Instead, focusing on reconstructing a small subset of features—specifically, those capturing global information—proves more effective for learning powerful representations. Motivated by these insights, we propose a novel framework, the Smooth Diffusion Model for Graphs (SDMG), which introduces a multiscale smoothing loss and low-frequency information encoders to promote the recovery of global, low-frequency details while suppressing irrelevant t high-frequency noise. Extensive experiments validate the effectiveness of our method, suggesting a promising direction for advancing diffusion models in graph representation learning. Our implementation is available at: https://anonymous.4open.science/r/SDMG-7FCD/.
Paperid:857
Authors:Lucy Xiaoyang Shi · brian ichter · Michael Equi · Liyiming Ke · Karl Pertsch · Quan Vuong · James Tanner · Anna Walling · Haohuan Wang · Niccolo Fusai · Adrian Li · Danny Driess · Lachy Groom · Sergey Levine · Chelsea Finn
Abstract:
Generalist robots that can perform a range of different tasks in openworld settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution. Intricate instructions (e.g., "Could you make me a vegetarian sandwich?" or "I don't like that one") require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world. In this work, we describe a system that uses vision-language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low-level actions. In contrast to direct instruction following methods that can fulfill simple commands ("pick up the cup"), our system can reason through complex prompts and incorporate situated feedback during task execution ("that's not trash"). We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dual-arm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping.Videos are available at https://hi-robot-vla.github.io/
Paperid:858
Authors:Yang Xin · Xingrun Li · Heng Chang · Yang jinze · xihong yang · Shengyu Tao · Maiko Shigeno · Ningkang Chang · Junfeng Wang · Dawei Yin · Erxue Min
Abstract:
Recommender systems are increasingly spreading to different areas like ecommerce or video streaming to alleviate information overload. One of the most fundamental methods for recommendation is Collaborative Filtering (CF), which leverages historical user-item interactions to infer user preferences. In recent years, Graph Neural Networks (GNNs) have been extensively studied to capture graph structures in CF tasks. Despite this remarkable progress, local structure modeling and embedding distortion still remain two notable limitations in the majority of GNN-based CF methods. Therefore, in this paper, we propose a novel Hyperbolic Graph Transformer architecture, to tackle the long-tail problems in CF tasks. Specifically, the proposed framework is comprised of two essential modules: 1) Local Hyperbolic Graph Convolutional Network (LHGCN), which performs graph convolution entirely in the hyperbolic manifold and captures the local structure of each node; 2) Hyperbolic Transformer, which is comprised of hyperbolic cross-attention mechanisms to capture global information. Furthermore, to enable its feasibility on large-scale data, we introduce an unbiased approximation of the cross-attention for linear computational complexity, with a theoretical guarantee in approximation errors. Empirical experiments demonstrate that our proposed model outperforms the leading collaborative filtering methods and significantly mitigates the long-tail issue in CF tasks. Our implementations are available in \url{https://anonymous.4open.science/r/Hgformer-2AE0}.
Paperid:859
Authors:Andrea Bertazzi · Tim Johnston · Gareth Roberts · Alain Oliviero Durmus
Abstract:
Abstract:This paper aims to provide differential privacy (DP) guarantees for Markov chain Monte Carlo (MCMC) algorithms. In a first part, we establish DP guarantees on samples output by MCMC algorithms as well as Monte Carlo estimators associated with these methods under assumptions on the convergence properties of the underlying Markov chain. In particular, our results highlight the critical condition of ensuring the target distribution is differentially private itself. In a second part, we specialise our analysis to the unadjusted Langevin algorithm and stochastic gradient Langevin dynamics and establish guarantees on their (Rényi) DP. To this end, we develop a novel methodology based on Girsanov's theorem combined with a perturbation trick to obtain bounds for an unbounded domain and in a nonconvex setting. We establish: (i) uniform in $n$ privacy guarantees when the state of the chain after $n$ iterations is released, (ii) bounds on the privacy of the entire chain trajectory. These findings provide concrete guidelines for privacy-preserving MCMC.
Paperid:860
Authors:Thibaut Boissin · Franck Mamalet · Thomas Fel · Agustin Picard · Thomas Massena · Mathieu Serrurier
Abstract:
Orthogonal convolutional layers are valuable components in multiple areas of machine learning, such as adversarial robustness, normalizing flows, GANs, and Lipschitzconstrained models. Their ability to preserve norms and ensure stable gradient propagation makes them valuable for a large range of problems. Despite their promise, the deployment of orthogonal convolution in large-scale applications is a significant challenge due to computational overhead and limited support for modern features like strides, dilations, group convolutions, and transposed convolutions.In this paper, we introduce \textbf{AOC} (Adaptative Orthogonal Convolution), a scalable method for constructing orthogonal convolutions, effectively overcoming these limitations. This advancement unlocks the construction of architectures that were previously considered impractical. We demonstrate through our experiments that our method produces expressive models that become increasingly efficient as they scale. To foster further advancement, we provide an open-source library implementing this method, available at https://anonymous.4open.science/r/orthogonium-DE07.
Paperid:861
Authors:Shreshth Saini · Ru-Ling Liao · Yan Ye · Alan Bovik
Abstract:
Despite recent advancements in latent diffusion models that generate highdimensional image data and perform various downstream tasks, there has been little exploration into perceptual consistency within these models on the task of No-Reference Image Quality Assessment (NR-IQA). In this paper, we hypothesize that latent diffusion models implicitly exhibit perceptually consistent local regions within the data manifold. We leverage this insight to guide on-manifold sampling using perceptual features and input measurements. Specifically, we propose \textbf{P}erceptual \textbf{M}anifold \textbf{G}uidance (PMG), an algorithm that utilizes pretrained latent diffusion models and perceptual quality features to obtain perceptually consistent multi-scale and multi-timestep feature maps from the denoising U-Net. We empirically demonstrate that these hyperfeatures exhibit high correlation with human perception in IQA tasks. %We leverage the perceptual manifold hypothesis for on-manifold guided sampling in the latent diffusion models and extract multi-scale and multi-timestep features from the denoising U-Net. Our method can be applied to any existing pretrained latent diffusion model and is straightforward to integrate. To the best of our knowledge, this paper is the first work on guiding diffusion model with perceptual features for NR-IQA. Extensive experiments on IQA datasets show that our method, LGDM, achieves state-of-the-art performance, underscoring the superior zero-shot generalization capabilities of diffusion models for NR-IQA tasks. The source code will be made publicly available upon publication.
Paperid:862
Authors:Adrian Müller · Jon Schneider · EFSTRATIOS PANTELEIMON SKOULAKIS · Luca Viano · Volkan Cevher
Abstract:
Abstract:In this paper, we investigate the existence of online learning algorithms with bandit feedback that simultaneously guarantee $O(1)$ regret compared to a given comparator strategy, and $O(\sqrt{T})$ regret compared to the best strategy in hindsight, where $T$ is the number of rounds. We provide the first affirmative answer to this question. In the context of symmetric zerosum games, both in normal- and extensive form, we show that our results allow us to guarantee to risk at most $O(1)$ loss while being able to gain $\Omega(T)$ from exploitable opponents, thereby combining the benefits of both no-regret algorithms and minimax play.
Paperid:863
Authors:Jisu Han · Jaemin Na · Wonjun Hwang
Abstract:
Testtime adaptation aims to adapt to realistic environments in an online manner by learning during test time. Entropy minimization has emerged as a principal strategy for test-time adaptation due to its efficiency and adaptability. Nevertheless, it remains underexplored in continual test-time adaptation, where stability is more important. We observe that the entropy minimization method often suffers from model collapse, where the model converges to predicting a single class for all images due to a trivial solution. We propose ranked entropy minimization to mitigate the stability problem of the entropy minimization method and extend its applicability to continuous scenarios. Our approach explicitly structures the prediction difficulty through a progressive masking strategy. Specifically, it gradually aligns the model's probability distributions across different levels of prediction difficulty while preserving the rank order of entropy. The proposed method is extensively evaluated across various benchmarks, demonstrating its effectiveness through empirical results.
Paperid:864
Authors:Ke Ren · Angelos Georghiou · Peyman Mohajerin Esfahani
Abstract:
We study inverse optimization (IO) where the goal is to use a parametric optimization program as the hypothesis class to infer relationships between inputdecision pairs. Most of the literature focuses on learning only the objective function, as learning the constraint function (i.e., feasible regions) leads to nonconvex training programs. Motivated by this, we focus on learning feasible regions for known linear objectives, and introduce two training losses along with a hypothesis class to parameterize the constraint function. Our hypothesis class surpasses the previous objective-only method by naturally capturing discontinuous behaviors in input-decision pairs. We introduce a customized block coordinate descent algorithm with a smoothing technique to solve the training problems, while for further restricted hypothesis classes, we reformulate the training optimization as a tractable convex program or mixed integer linear program. Synthetic experiments and two power system applications including comparisons with state-of-the-art approaches showcase and validate the proposed approach.
Paperid:865
Authors:Matthew Chen · Josh Engels · Max Tegmark
Abstract:
Abstract:Sparse autoencoders (SAEs) aim to decompose language model representations into a sparse set of linear latent vectors. Recent works have improved SAEs using language model gradients, but these techniques require many expensive backward passes during training and still cause a significant increase in cross entropy loss when SAE reconstructions are inserted into the model. In this work, we improve on these limitations by taking a fundamentally different approach: we use lowrank adaptation (LoRA) to finetune the *language model itself* around a previously trained SAE. We analyze our method across SAE sparsity, SAE width, language model size, LoRA rank, and model layer on the Gemma Scope family of SAEs. In these settings, our method reduces the cross entropy loss gap by 30% - 55% when SAEs are inserted during the forward pass. We also find that compared to end-to-end (e2e) SAEs, our approach achieves the same downstream cross entropy loss 3$\times$ to 20$\times$ faster on Gemma-2-2B and 2$\times$ to 10$\times$ faster on Llama-3.2-1B. We further show that our technique improves downstream metrics and can adapt multiple SAEs at once. Our results demonstrate that improving model interpretability is not limited to post-hoc SAE training; Pareto improvements can also be achieved by directly optimizing the model itself.
Paperid:866
Authors:Li · Qianqian Xu · Zhiyong Yang · Zitai Wang · Linchao Zhang · Xiaochun Cao · Qingming Huang
Abstract:
Realworld datasets often follow a long-tailed distribution, making generalization to tail classes difficult. Recent methods resorted to long-tail variants of Sharpness-Aware Minimization (SAM), such as ImbSAM and CC-SAM, to improve generalization by flattening the loss landscape. However, these attempts face a trade-off between computational efficiency and control over the loss landscape. On the one hand, ImbSAM is efficient but offers only coarse control as it excludes head classes from the SAM process. On the other hand, CC-SAM provides fine-grained control through class-dependent perturbations but at the cost of efficiency due to multiple backpropagations. Seeing this dilemma, we introduce Focal-SAM, which assigns different penalties to class-wise sharpness, achieving fine-grained control without extra backpropagations, thus maintaining efficiency. Furthermore, we theoretically analyze Focal-SAM's generalization ability and derive a sharper generalization bound. Extensive experiments on both traditional and foundation models validate the effectiveness of Focal-SAM.
Paperid:867
Authors:Yusheng Zhao · Qixin Zhang · Xiao Luo · Junyu Luo · Wei Ju · Zhiping Xiao · Ming Zhang
Abstract:
Testtime adaptation aims to adapt a well-trained model using test data only, without accessing training data. While existing works on test-time adaptation primarily focus on Euclidean data, research on non-Euclidean graph data remains scarce. Prevalent graph neural network methods could encounter serious performance degradation in the face of test-time domain shifts. In this work, we propose a novel method named Adaptive Subgraph-based Selection and Regularized Prototype Supervision (ASSESS) for reliable test-time adaptation on graphs. Specifically, to achieve flexible selection of reliable test graphs, ASSESS adopts an adaptive selection strategy based on fine-grained individual-level subgraph mutual information. Moreover, to utilize the information from both training and test graphs, ASSESS constructs semantic prototypes from the well-trained model as prior knowledge from the unknown training graphs and optimizes the posterior given the unlabeled test graphs. We also provide a theoretical analysis of the proposed algorithm. Extensive experiments verify the effectiveness of ASSESS.
Paperid:868
Authors:Zeyu Gan · Yun Liao · Yong Liu
Abstract:
Testtime scaling, which is also often referred to asslow-thinking, has been demonstrated to enhance multi-step reasoning in large language models (LLMs). However, despite its widespread utilization, the mechanisms underlying slow-thinking methods remain poorly understood. This paper explores the mechanisms of external slow-thinking from a theoretical standpoint. We begin by examining the snowball error effect within the LLM reasoning process and connect it to the likelihood of correct reasoning using information theory. Building on this, we show that external slow-thinking methods can be interpreted as strategies to mitigate the error probability. We further provide a comparative analysis of popular external slow-thinking approaches, ranging from simple to complex, highlighting their differences and interrelationships. Our findings suggest that the efficacy of these methods is not primarily determined by the specific framework employed, and that expanding the search scope or the model's internal reasoning capacity may yield more sustained improvements in the long term. We open-source our code through an anonymous repository https://anonymous.4open.science/r/Snowball-Errors-and-Probability.
Paperid:869
Authors:Samira Abnar · Harshay Shah · Dan Busbridge · Alaaeldin Ali · Joshua M Susskind · Vimal Thilak
Abstract:
Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixtureof-Expert models (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the ratio of non-active to total parameters, affects model performance in terms of both pretraining and downstream performance. We find that under different constraints (e.g. parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance.These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.
Paperid:870
Authors:Jinyu Cai · Yunhe Zhang · Jicong Fan
Abstract:
Identifying anomalous graphs is essential in realworld scenarios such as molecular and social network analysis, yet anomalous samples are generally scarce and unavailable. This paper proposes a Self-Discriminative Modeling (SDM) framework that trains a deep neural network only on normal graphs to detect anomalous graphs. The neural network simultaneously learns to construct pseudo-anomalous graphs from normal graphs and learns an anomaly detector to recognize these pseudo-anomalous graphs. As a result, these pseudo-anomalous graphs interpolate between normal graphs and real anomalous graphs, which leads to a reliable decision boundary of anomaly detection. In this framework, we develop three algorithms with different computational efficiencies and stabilities for anomalous graph detection. Extensive experiments on 12 different graph benchmarks demonstrated that the three variants of SDM consistently outperform the state-of-the-art GLAD baselines. The success of our methods stems from the integration of the discriminative classifier and the well-posed pseudo-anomalous graphs, which provided new insights for graph-level anomaly detection.
Paperid:871
Authors:Jiaqi Yan · Changping Wang · De Ma · Huajin Tang · Qian Zheng · Gang Pan
Abstract:
Spiking Neural Networks (SNNs) are considered promising energyefficient models due to their spatial-temporal information processing capability. Existing work has demonstrated that SNNs exhibit temporal heterogeneity, which leads to diverse outputs of SNNs at different time steps and has the potential to enhance their performance. Although SNNs obtained by direct training methods achieve state-of-the-art performance, current methods introduce limited temporal heterogeneity through dynamics of spiking neurons and lack the improvement of temporal heterogeneity through the lens of gradient. In this paper, we first conclude that the diversity of the partial derivative of the loss function to the logits in current methods is limited. This results in the insufficiency of temporal heterogeneity and further limits the performance. Based on the above analysis, we propose a temporal model calibration method, which can be seen as a gradient rescaling mechanism of the cross-entropy loss at each time step. Experimental results show that our method can improve the temporal heterogeneity and performance of SNNs. In particular, our method achieves state-of-the-art accuracy on ImageNet, DVSCIFAR10, and N-Caltech101, with respective accuracies of 85.83\%, 87.63\%, and 86.04\%.
Paperid:872
Authors:Oliver Sieberling · Denis Kuznedelev · Eldar Kurtic · Dan Alistarh
Abstract:
The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, nonuniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on estimating the "importance" of a given layer, implicitly assuming that layers contribute independently to the overall compression error. We begin from the motivating observation that this independence assumption does not generally hold for LLM compression: pruning a model further may even significantly recover performance.To address this, we propose EvoPress, a novel evolutionary framework for dynamic LLM compression. By formulating dynamic compression as a general optimization problem, EvoPress identifies optimal compression profiles in a highly efficient manner, and generalizes across diverse models and compression techniques. Via EvoPress, we achieve state-of-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths.
Paperid:873
Authors:Casper Gyurik · Riccardo Molteni · Vedran Dunjko
Abstract:
In recent times, there have been major developments in two distinct yet connected domains of quantum information. On the one hand, substantial progress has been made in socalled randomized measurement protocols. Here, a number of properties of unknown quantum states can be deduced from surprisingly few measurement outcomes, using schemes such as classical shadows. On the other hand, significant progress has been made in quantum machine learning. For example, exponential advantages have been proven when the data consists of quantum states and quantum algorithms can coherently measure multiple copies of input states. In this work, we aim to understand the implications and limitations of combining randomized measurement protocols with quantum machine learning, although the implications are broader. Specifically, we investigate quantum machine learning algorithms that, when dealing with quantum data, can either process it entirely using quantum methods or measure the input data through a fixed measurement scheme and utilize the resulting classical information. We prove limitations for the general class of quantum machine learning algorithms that use fixed measurement schemes on the input quantum states.Our results have several implications. From the perspective of randomized measurement procedures, we show limitations of measure-first protocols in the average case, improving on the state-of-the-art which only focuses on worst-case scenarios. Additionally, previous lower bounds were only known for physically unrealizable states. We improve upon this by employing quantum pseudorandom functions to prove that a learning separation also exists when dealing with physically realizable states, which may be encountered in experiments. From a machine learning perspective, our results are crucial for defining a physically meaningful task that shows fully quantum machine learning processing is not only more efficient but also necessary for solving certain problems. The tasks at hand are also realistic, as the algorithms and proven separations hold when working with efficiently preparable states and remain robust in the presence of measurement and preparation errors.
Paperid:874
Authors:Zi Liang · Haibo Hu · Qingqing Ye · Yaxin XIAO · RongHua Li
Abstract:
Low rank adaptation (LoRA) has emerged as a prominent technique for finetuning large language models (LLMs) thanks to its superb efficiency gains over previous methods. While extensive studies have examined the performance and structural properties of LoRA, its behavior upon training-time attacks remain underexplored, posing significant security risks. In this paper, we theoretically investigate the security implications of LoRA's low-rank structure during fine-tuning, in the context of its robustness against data poisoning and backdoor attacks. We propose an analytical framework that models LoRA’s training dynamics, employs the neural tangent kernel to simplify the analysis of the training process, and applies information theory to establish connections between LoRA's low rank structure and its vulnerability against training-time attacks. Our analysis indicates that LoRA exhibits better robustness to backdoor attacks than full fine-tuning, while becomes more vulnerable to untargeted data poisoning due to its over-simplified information geometry. Extensive experimental evaluations have corroborated our theoretical findings.
Paperid:875
Authors:Jinho Chang · Jong Chul YE
Abstract:
With the emergence of diffusion models as a frontline generative model, many researchers have proposed molecule generation techniques with conditional diffusion models. However, the unavoidable discreteness of a molecule makes it difficult for a diffusion model to connect raw data with highly complex conditions like natural language. To address this, here we present a novel latent diffusion model dubbed LDMol for textconditioned molecule generation. By recognizing that the suitable latent space design is the key to the diffusion model performance, we employ a contrastive learning strategy to extract novel feature space from text data that embeds the unique characteristics of the molecule structure. Experiments show that LDMol outperforms the existing autoregressive baselines on the text-to-molecule generation benchmark, being one of the first diffusion models that outperforms autoregressive models in textual data generation with a better choice of the latent domain.Furthermore, we show that LDMol can be applied to downstream tasks such as molecule-to-text retrieval and text-guided molecule editing, demonstrating its versatility as a diffusion model.
Paperid:876
Authors:Yuhui Ding · Thomas Hofmann
Abstract:
Equivariant diffusion models have achieved impressive performance in 3D molecule generation. These models incorporate the Euclidean symmetry of 3D molecules by utilizing an SE(3)equivariant denoising network. However, the specialized equivariant architecture limits the scalability and efficiency of diffusion models. In this paper, we propose an approach to relax the equivariance constraint. Specifically, our approach learns a sample-dependent SO(3) transformation for each molecule to construct an aligned latent space. A non-equivariant diffusion model is then trained on the aligned representations. Experimental results demonstrate that our approach performs significantly better than previously reported non-equivariant models. It yields sample quality comparable to state-of-the-art equivariant diffusion models and offers improved training and sampling efficiency.
Paperid:877
Authors:Ilya Kaufman · Omri Azencot
Abstract:
Deep learning models with a large number of parameters, often referred to as overparameterized models, have achieved exceptional performance across various tasks. Despite concerns about overfitting, these models frequently generalize well to unseen data, thanks to effective regularization techniques, with data augmentation being among the most widely used. While data augmentation has shown great success in classification tasks using label-preserving transformations, its application in regression problems has received less attention. Recently, a novel manifold learning approach for generating synthetic data was proposed, utilizing a first-order approximation of the data manifold. Building on this foundation, we present a theoretical framework and practical tools for approximating and sampling general data manifolds. Furthermore, we introduce the Curvature-Enhanced Manifold Sampling (CEMS) method for regression tasks. CEMS leverages a second-order representation of the data manifold to enable efficient sampling and reconstruction of new data points. Extensive evaluations across multiple datasets and comparisons with state-of-the-art methods demonstrate that CEMS delivers superior performance in both in-distribution and out-of-distribution scenarios, while introducing only minimal computational overhead.
Paperid:878
Authors:Thomas Zeng · Shuibai Zhang · Shutong Wu · Christian Classen · Daewon Chae · Ethan Ewer · Minjae Lee · Heeju Kim · Wonjun Kang · Jackson Kunde · Ying Fan · Jungtaek Kim · HYUNG IL KOO · Kannan Ramchandran · Dimitris Papailiopoulos · Kangwook Lee
Abstract:
Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inferencetime computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduceVersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline–surpassing Qwen2.5-Math-PRM's gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.
Paperid:879
Authors:Kaiwen Wang · Dawen Liang · Nathan Kallus · Wen Sun
Abstract:
We study risksensitive RL where the goal is learn a history-dependent policy that optimizes some risk measure of cumulative rewards.We consider a family of risks called the optimized certainty equivalents (OCE), which captures important risk measures such as conditional value-at-risk (CVaR), entropic risk and Markowitz's mean-variance.In this setting, we propose two meta-algorithms: one grounded in optimism and another based on policy gradients, both of which can leverage the broad suite of risk-neutral RL algorithms in an augmented Markov Decision Process (MDP).Via a reductions approach, we leverage theory for risk-neutral RL to establish novel OCE bounds in complex, rich-observation MDPs.For the optimism-based algorithm, we prove bounds that generalize prior results in CVaR RL and that provide the first risk-sensitive bounds for exogenous block MDPs.For the gradient-based algorithm, we establish both monotone improvement and global convergence guarantees under a discrete reward assumption.Finally, we empirically show that our algorithms learn the optimal history-dependent policy in a proof-of-concept MDP, where all Markovian policies provably fail.
Paperid:880
Authors:Elliot Myunghoon Kim · Avi Garg · Kenny Peng · Nikhil Garg
Abstract:
While ecosystem diversity (e.g., in training data or model architecture) is often implicitly assumed to mitigate systemic bias and enable the benefits of multimodel collaboration, we lack empirical evidence about whether different LLMs are meaningfully different. We first conduct a large-scale empirical study of correlation across LLMs, using responses from models across multiple leaderboards. We find substantial correlation in model errors---on one leaderboard dataset, for example, models on average agree on the same wrong answer on 60\% of questions on which both are incorrect. We identify factors driving model correlation, including shared base model architecture and development organization. Crucially, however, independent components are insufficient for independent generations: larger, more accurate models are highly correlated, even when both err. We further examine the downstream impact of model correlation using a dataset of job descriptions and resumes, finding again that larger models and models from the same organization produce more correlated results. Experiments on simulated matching markets illustrate the impact of this ``algorithmic monoculture.''
Paperid:881
Authors:Lexiang Hu · Yikang Li · Zhouchen Lin
Abstract:
Abstract:Symmetry is widely applied in problems such as the design of equivariant networks and the discovery of governing equations, but in complex scenarios, it is not known in advance. Most previous symmetry discovery methods are limited to linear symmetries, and recent attempts to discover nonlinear symmetries fail to explicitly get the Lie algebra subspace. In this paper, we propose LieNLSD, which is, to our knowledge, the first method capable of determining the number of infinitesimal generators with nonlinear terms and their explicit expressions. We specify a function library for the infinitesimal group action and aim to solve for its coefficient matrix, proving that its prolongation formula for differential equations, which governs dynamic data, is also linear with respect to the coefficient matrix. By substituting the central differences of the data and the Jacobian matrix of the trained neural network into the infinitesimal criterion, we get a system of linear equations for the coefficient matrix, which can then be solved using SVD. On top quark tagging and a series of dynamic systems, LieNLSD shows qualitative advantages over existing methods and improves the long rollout accuracy of neural PDE solvers by over $20\\%$ while applying to guide data augmentation.
Paperid:882
Authors:Yufei Huang · Changhu Wang · Junjie Tang · Weichi Wu · Ruibin Xi
Abstract:
Traditional network inference methods, such as Gaussian Graphical Models, which are built on continuity and homogeneity, face challenges when modeling discrete data and heterogeneous frameworks. Furthermore, under highdimensionality, the parameter estimation of such models can be hindered by the notorious intractability of high-dimensional integrals. In this paper, we introduce a new and flexible device for graphical models, which accommodates diverse data types, including Gaussian, Poisson log-normal, and latent Gaussian copula models. The new device is driven by a new marginally recoverable parametric family, which can be effectively estimated without evaluating the high-dimensional integration in high-dimensional settings thanks to the marginal recoverability. We further introduce a mixture of marginally recoverable models to capture ubiquitous heterogeneous structures. We show the validity of the desirable properties of the models and the effective estimation methods, and demonstrate their advantages over the state-of-the-art network inference methods via extensive simulation studies and a gene regulatory network analysis of real single-cell RNA sequencing data.
Paperid:883
Authors:Anne Wu · Laurent Mazaré · Neil Zeghidour · Alexandre Défossez
Abstract:
We propose a novel preference alignment framework for improving spoken dialogue models on realtime conversations from user interactions. Current preference learning methods primarily focus on text-based language models, and are not directly suited to the complexities of real-time speech interactions, with richer dynamics (e.g. interruption, interjection) and no explicit segmentation between speaker turns.We create a large-scale dataset of more than 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback, to cover preferences over both linguistic content and temporal context variations. We leverage offline alignment methods to finetune a full-duplex autoregressive speech-to-speech model. Extensive experiments demonstrate that feedback on generic conversations can be consistently effective in improving spoken dialogue models to produce more factual, safer and more contextually aligned interactions. We deploy the finetuned model and conduct holistic human evaluations to assess the impact beyond single-turn conversations. Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
Paperid:884
Authors:Andi Han · Wei Huang · Zhanpeng Zhou · Gang Niu · Wuyang Chen · Junchi Yan · Akiko Takeda · Taiji Suzuki
Abstract:
Deep learning with noisy labels presents significant challenges. In this work, we theoretically characterize the role of label noise from a feature learning perspective. Specifically, we consider a signalnoise data distribution, where each sample comprises a label-dependent signal and label-independent noise, and rigorously analyze the training dynamics of a two-layer convolutional neural network under this data setup, along with the presence of label noise. Our analysis identifies two key stages. In Stage I, the model perfectly fits all the clean samples (i.e., samples without label noise) while ignoring the noisy ones (i.e., samples with noisy labels). During this stage, the model learns the signal from the clean samples, which generalizes well on unseen data. In Stage II, as the training loss converges, the gradient in the direction of noise surpasses that of the signal, leading to overfitting on noisy samples. Eventually, the model memorizes the noise present in the noisy samples and degrades its generalization ability. Furthermore, our analysis provides a theoretical basis for two widely used techniques for tackling label noise: early stopping and sample selection. Experiments on both synthetic and real-world setups validate our theory.
Paperid:885
Authors:Tzu-Tao (Tommy) Chang · Shivaram Venkataraman
Abstract:
Abstract:Crossattention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks locally on each GPU and exchange smaller query blocks across GPUs. We also introduce an efficient activation recomputation technique enabling support for longer visual context. We theoretically analyze the communication benefits of LV-XAttn and show that it can achieve speedups for a wide range of models. Our evaluations with mPLUG-Owl3 and OpenFlamingo models find that LV-XAttn achieves up to 5.58$\times$ end-to-end speedup compared to existing approaches.
Paperid:886
Authors:Zongqian Wu · Baoduo Xu · Ruochen Cui · Mengmeng Zhan · Xiaofeng Zhu · Lei Feng
Abstract:
Chainof-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in LLM. Interestingly, we observe that both CoT reasoning and self-training share the core objective: iteratively leveraging model-generated information to progressively reduce prediction uncertainty. Building on this insight, we propose a novel CoT framework to improve reasoning performance. Our framework integrates two key components: (i) a task-specific prompt module that optimizes initial reasoning processes, and (ii) an adaptive reasoning iteration module that dynamically refines the reasoning process and addresses the limitations of previous CoT approaches, i.e., over-reasoning and high similarity between consecutive reasoning iterations. Extensive experiments demonstrate that the proposed method achieves significant advantages in both performance and computational efficiency.
Paperid:887
Authors:Aryan Gulati · Brando Miranda · Eric Chen · Emily Xia · Kai Fronsdal · Bruno de Moraes Dumont · Sanmi Koyejo
Abstract:
As large language models (LLMs) continue to advance, many existing benchmarks designed to evaluate their reasoning capabilities are becoming saturated. Therefore, we present the PutnamAXIOM Original benchmark consisting of 522 mathematical problems from the William Lowell Putnam Mathematical Competition. To preserve the Putnam-AXIOM benchmark's validity and mitigate potential data contamination, we created the Putnam-AXIOM Variation benchmark with functional variations of 100 problems. By programmatically altering problem elements like variables and constants, we can generate unlimited novel, equally challenging problems not found online. We see that almost all models have significantly lower accuracy in the variations than the original problems. Our results reveal that OpenAI's o1-preview, the best performing model, achieves 41.94\% accuracy on the Putnam-AXIOM Original but experiences a nearly 20\% reduction in accuracy on the variations' dataset when compared to corresponding original problems. Moreover, we explore metrics beyond boxed accuracy to assess models on complex tasks like natural language theorem proving, crucial for evaluating reasoning capabilities in depth, opening the possibility for open-ended evaluation of reasoning strings.
Paperid:888
Authors:Aditya Taparia · Som Sagar · Ransalu Senanayake
Abstract:
Understanding the inner representation of a neural network helps users improve models. Conceptbased methods have become a popular choice for explaining deep neural networks post-hoc because, unlike most other explainable AI techniques, they can be used to test high-level visual "concepts" that are not directly related to feature attributes. For instance, the concept of "stripes" is important to classify an image as a zebra. Concept-based explanation methods, however, require practitioners to guess and manually collect multiple candidate concept image sets, making the process labor-intensive and prone to overlooking important concepts. Addressing this limitation, in this paper, we frame concept image set creation as an image generation problem. However, since naively using a standard generative model does not result in meaningful concepts, we devise a reinforcement learning-based preference optimization (RLPO) algorithm that fine-tunes a vision-language generative model from approximate textual descriptions of concepts. Through a series of experiments, we demonstrate our method's ability to efficiently and reliably articulate diverse concepts that are otherwise challenging to craft manually.
Paperid:889
Authors:Fabiola Ricci · Lorenzo Bardone · Sebastian Goldt
Abstract:
Abstract:Deep neural networks learn structured features from complex, nonGaussian inputs, but the mechanisms behind this process remain poorly understood. Our work is motivated by the observation that the first-layer filters learnt by deep convolutional neural networks from natural images resemble those learnt by independent component analysis (ICA), a simple unsupervised method that seeks the most non-Gaussian projections of its inputs. This similarity suggests that ICA provides a simple, yet principled model for studying feature learning. Here, we leverage this connection to investigate the interplay between data structure and optimisation in feature learning for the most popular ICA algorithm, FastICA, and stochastic gradient descent (SGD), which is used to train deep networks. We rigorously establish that FastICA requires at least $n\gtrsim d^4$ samples to recover a single non-Gaussian direction from $d$-dimensional inputs on a simple synthetic data model. We show that vanilla online SGD outperforms FastICA, and prove that the optimal sample complexity $n\gtrsim d^2$ can be reached by smoothing the loss, albeit in a data-dependent way. We finally demonstrate the existence of a search phase for FastICA on ImageNet, and discuss how the strong non-Gaussianity of said images compensates for the poor sample complexity of FastICA.
Paperid:890
Authors:Hongli Zhan · Muneeza Azmat · Raya Horesh · Junyi Jessy Li · Mikhail Yurochkin
Abstract:
Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resourceintensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness.
Paperid:891
Authors:An Vo · Mohammad Reza Taesiri · Anh Nguyen · Daeyoung Kim
Abstract:
Large language models (LLMs) were found to contain strong gender biases (e.g., against female) or numerical biases (e.g., for number 7). We test whether LLMs would be able to output less biased answers when allowed to observe its prior answers to the same question in a \multiturn conversation. For thorough evaluation of LLM biases across different question types, we propose a set of questions spanning 9 topics and across 4 categories: questions that ask for Subjective opinions; Random answers; or objective answers to realworld Easy or Hard questions. Interestingly, LLMs are able to "de-bias'' themselves in multi-turn settings in response to Random questions but not other categories. Furthermore, we propose B-score, a novel metric that is effective in detecting biases to Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e., accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or single-turn probabilities alone.
Paperid:892
Authors:Lars van der Laan · Ahmed Alaa
Abstract:
Ensuring model calibration is critical for reliable predictions, yet popular distributionfree methods like histogram binning and isotonic regression provide only asymptotic guarantees. We introduce a unified framework forVennandVenn-Aberscalibration, generalizing Vovk's binary classification approach to arbitrary prediction tasks and loss functions. Venn calibration leverages~binning calibrators to construct prediction sets that contain at least one marginally perfectly calibrated prediction in finite samples, while these sets generally converge asymptotically to conditionally calibrated point predictions, effectively capturing epistemic uncertainty. Furthermore, we proposeVenn multicalibration, a novel methodology for finite-sample calibration across subpopulations. For quantile loss, Mondrian and multicalibrated conformal prediction arise as special cases of Venn calibration and multicalibration produces intervals achieving quantile-conditional coverage.
Paperid:893
Authors:Xiuyi Jia · Jinchi Li · Yunan Lu · Weiwei Li
Abstract:
Label distribution learning (LDL) is a novel machine learning paradigm that can handle label ambiguity. This paper focuses on the interpretability issue of label distribution learning. Existing local interpretability models are mainly designed for singlelabel learning problems and are difficult to directly interpret label distribution learning models. In response to this situation, we propose an improved LIME algorithm that can effectively interpret any black-box model in label distribution learning. To address the label dependency problem, we introduce the feature attribution distribution matrix and derive the solution formula for explanations under the label distribution form. Meanwhile, to enhance the transparency and trustworthiness of the explanation algorithm, we provide an analytical solution and derive the boundary conditions for explanation convergence and stability. In addition, we design a feature selection scoring function and a fidelity metric for the explanation task of label distribution learning. A series of numerical experiments and human experiments were conducted to validate the performance of the proposed algorithm in practical applications. The experimental results demonstrate that the proposed algorithm achieves high fidelity, consistency, and trustworthiness in explaining LDL models.
Paperid:894
Authors:Andres Guzman Cordero · Floor Eijkelboom · Jan-Willem van de Meent
Abstract:
While denoising diffusion and flow matching have driven major advances in generative modeling, their application to tabular data remains limited, despite its ubiquity in realworld applications. To this end, we develop TabbyFlow, a variational Flow Matching (VFM) method for tabular data generation. To apply VFM to data with mixed continuous and discrete features, we introduce Exponential Family Variational Flow Matching (EF-VFM), which represents heterogeneous data types using a general exponential family distribution. We hereby obtain an efficient, data-driven objective based on moment matching, enabling principled learning of probability paths over mixed continuous and discrete variables. We also establish a connection between variational flow matching and generalized flow matching objectives based on Bregman divergences. Evaluation on tabular data benchmarks demonstrates state-of-the-art performance compared to baselines.
Paperid:895
Authors:CAN CHEN · Jun-Kun Wang
Abstract:
Developing algorithms to differentiate between machinegenerated texts and human-written texts has garnered substantial attention in recent years. Existing methods in this direction typically concern an offline setting where a dataset containing a mix of real and machine-generated texts is given upfront, and the task is to determine whether each sample in the dataset is from a large language model (LLM) or a human. However, in many practical scenarios, sources such as news websites, social media accounts, or on other forums publish content in a streaming fashion. Therefore, in this online scenario, how to quickly and accurately determine whether the source is an LLM with strong statistical guarantees is crucial for these media or platforms to function effectively and prevent the spread of misinformation and other potential misuse of LLMs. To tackle the problem of online detection, we develop an algorithm based on the techniques of sequential hypothesis testing by betting that not only builds upon and complements existing offline detection techniques but also enjoys statistical guarantees, which include a controlled false positive rate and the expected time to correctly identify a source as an LLM. Experiments were conducted to demonstrate the effectiveness of our method.
Paperid:896
Authors:Hojoon Lee · Youngdo Lee · Takuma Seno · Donghu Kim · Peter Stone · Jaegul Choo
Abstract:
Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on nonstationary data easily leads to overfitting and unstable optimization.In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains.
Paperid:897
Authors:Cheng Zhang · Hanna Foerster · Robert Mullins · Yiren Zhao · Ilia Shumailov
Abstract:
It is now a common business practice to buy access to large language model (LLM) inference rather than selfhost, because of significant upfront hardware infrastructure and energy costs. However, as a buyer, there is no mechanism to verify the authenticity of the advertised service including the serving hardware platform, e.g. that it is actually being served using an NVIDIA H100. Furthermore, there are reports suggesting that model providers may deliver models that differ slightly from the advertised ones, often to make them run on less expensive hardware. That way, a client pays premium for a capable model access on more expensive hardware, yet ends up being served by a (potentially less capable) cheaper model on cheaper hardware. In this paper we introducehardware and software platform inference (HSPI)-- a method for identifying the underlying GPU architecture and software stack of a (black-box) machine learning model solely based on its input-output behavior. Our method leverages the inherent differences of various GPU architectures and compilers to distinguish between different GPU types and software stacks. By analyzing the numerical patterns in the model's outputs, we propose a classification framework capable of accurately identifying the GPU used for model inference as well as the underlying software configuration. Our findings demonstrate the feasibility of inferring GPU type from black-box models. We evaluate HSPI against models served on different real hardware and find that in a white-box setting we can distinguish between different GPUs with between 83.9% and 100% accuracy. Even in a black-box setting we are able to achieve results that are up to three times higher than random guess accuracy.
Paperid:898
Authors:RISHI JINKA · Venkata Sai Mothish Gonugunta · Deepak N. Subramani
Abstract:
Timeseries forecasting finds application across domains such as finance, climate science, and energy systems. We introduce the Conditional Diffusion with Nonlinear Data Transformation Model (CN-Diff), a generative framework that employs novel nonlinear transformations and learnable conditions in the forward process for time series forecasting. A new loss formulation for training is proposed, along with a detailed derivation of both forward and reverse process. The new additions improve the diffusion model's capacity to capture complex time series patterns, thus simplifying the reverse process. Our novel condition facilitates learning an efficient prior distribution. This also reduces the gap between the true negative log-likelihood and its variational approximation. CN-Diff is shown to perform better than other leading time series models on nine real-world datasets. Ablation studies are conducted to elucidate the role of each component of CN-Diff.
Paperid:899
Authors:Kosuke Sugiyama · Masato Uchida
Abstract:
In many realworld applications, predictive tasks inevitably involve low-quality input features (Weak Features; WFs) which arise due to factors such as misobservations, missingness, or partial observations. While several methods have been proposed to estimate the true values of specific types of WFs and to solve a downstream task, a unified theoretical framework that comprehensively addresses these methods remains underdeveloped. In this paper, we propose a unified framework called Weak Features Learning (WFL), which accommodates arbitrary discrete WFs and a broad range of learning algorithms, and we demonstrate its validity. Furthermore, we introduce a class of algorithms that learn both the estimation model for WFs and the predictive model for a downstream task and perform a generalization error analysis under finite-sample conditions. Our results elucidate the interdependencies between the estimation errors of WFs and the prediction error of a downstream task, as well as the theoretical conditions necessary for the learning approach to achieve consistency. This work establishes a unified theoretical foundation, providing generalization error analysis and performance guarantees, even in scenarios where WFs manifest in diverse forms.
Paperid:900
Authors:Maria Despoina Siampou · Jialiang Li · John Krumm · Cyrus Shahabi · Hua Lu
Abstract:
Encoding geospatial objects is fundamental for geospatial artificial intelligence (GeoAI) applications, which leverage machine learning (ML) models to analyze spatial information. Common approaches transform each object into known formats, like image and text, for compatibility with ML models. However, this process often discards crucial spatial information, such as the object's position relative to the entire space, reducing downstream task effectiveness. Alternative encoding methods that preserve some spatial properties are often devised for specific data objects (e.g., point encoders), making them unsuitable for tasks that involve different data types (i.e., points, polylines, and polygons). To this end, we propose Poly2Vec, a polymorphic Fourierbased encoding approach that unifies the representation of geospatial objects, while preserving the essential spatial properties. Poly2Vec incorporates a learned fusion module that adaptively integrates the magnitude and phase of the Fourier transform for different tasks and geometries.We evaluate Poly2Vec on five diverse tasks, organized into two categories. The first empirically demonstrates that Poly2Vec consistently outperforms object-specific baselines in preserving three key spatial relationships: topology, direction, and distance. The second shows that integrating Poly2Vec into a state-of-the-art GeoAI workflow improves the performance in two popular tasks: population prediction and land use inference.
Paperid:901
Authors:Minh Hieu Nong · Antoine Ledent
Abstract:
Contrastive Representation Learning (CRL) has achieved impressive success in various domains in recent years. Nevertheless, the theoretical understanding of the generalization behavior of CRL is limited. Moreover, to the best of our knowledge, the current literature only analyzes generalization bounds under the assumption that the data tuples used for contrastive learning are independently and identically distributed. However, in practice, we are often limited to a fixed pool of reusable labeled data points, making it inevitable to recycle data across tuples to create sufficiently large datasets. Therefore, the tuplewise independence condition imposed by previous works is invalidated. In this paper, we provide a generalization analysis for the CRL framework under non-i.i.d. settings that adheres to practice more realistically. Drawing inspiration from the literature on U-statistics, we derive generalization bounds which indicate the required number of samples in each class scales as the logarithm of the covering number of the class of learnable feature representations associated to each class. Next, we apply our main results to derive excess risk bounds for common function classes such as linear functions and neural networks.
Paperid:902
Authors:Yiwei Wu · Atticus Geiger · Raphaël Millière
Abstract:
Variable binding the ability to associate variables with values -- is fundamental to symbolic computation and cognition. While typically implemented via addressable memory in classical architectures, we are only beginning to uncover how modern neural networks without built-in binding operations can acquire this capacity. We investigate further by training a Transformer on the task of dereferencing queried variables in synthetic programs where variables are assigned either numerical constants or other variables. Each program requires following chains of variable assignments up to four steps deep to find the queried value, and also contains irrelevant chains of assignments acting as distractors. Our analysis reveals a developmental trajectory with three distinct phases during training: (1) random prediction of numerical constants, (2) a shallow heuristic prioritizing early variable assignments, and (3) the emergence of a systematic mechanism for dereferencing. Using causal interventions, we find that the model learns to exploit the residual stream as an addressable memory space, with specialized attention heads routing information across token positions. This mechanism allows the model to dynamically track variable bindings across layers, culminating in accurate dereferencing. Notably, the network does not store the full program state, suggesting the development of an efficient strategy optimized for the task. Our results demonstrate that Transformer models can learn to implement key aspects of symbolic computation, including systematic variable binding, without explicit architectural support. This work bridges connectionist and symbolic approaches, offering new insights into how neural networks might acquire structured reasoning capabilities.
Paperid:903
Authors:Konstantinos Oikonomidis · Jan Quan · Emanuel Laude · Panagiotis Patrinos
Abstract:
Abstract:We analyze nonlinearly preconditioned gradient methods for solving smooth minimization problems. We introduce a generalized smoothness property, based on the notion of abstract convexity, that is broader than Lipschitz smoothness and provide sufficient firstand second-order conditions. Notably, our framework encapsulates algorithms associated with the clipping gradient method and brings out novel insights for the class of $(L_0,L_1)$-smooth functions that has received widespread interest recently, thus allowing us to go beyond already established methods. We investigate the convergence of the proposed method in both the convex and nonconvex setting.
Paperid:904
Authors:Lixin Yuan · Yirui Wu · WENXIAO ZHANG · Minglei Yuan · Jun Liu
Abstract:
Abstract:For data with intraclass Asymmetric instance Distribution or Multiple High-density Clusters (ADMHC), outliers are real and have specific patterns for data classification, where the class body is necessary and difficult to identify. Previous Feature Selection (FS) methods score features based on all training instances or rarely target intra-class ADMHC. In this paper, we propose a supervised FS method, Stray Intrusive Outliers-based FS (SIOFS), for data classification with intra-class ADMHC. By focusing on Stray Intrusive Outliers (SIOs), SIOFS modifies the skewness coefficient and fuses the threshold in the 3$\sigma$ principle to identify the class body, scoring features based on the intrusion degree of SIOs. In addition, the refined density-mean center is proposed to represent the general characteristics of the class body reasonably. Mathematical formulations, proofs, and logical exposition ensure the rationality and universality of the settings in the proposed SIOFS method. Extensive experiments on 15 diverse benchmark datasets demonstrate the superiority of SIOFS over 12 state-of-the-art FS methods in terms of classification accuracy, normalized mutual information, and confusion matrix.
Paperid:905
Authors:Felix Hofstätter · Teun van der Weij · Jayden Teoh · Henning Bartsch · Rada Djoneva · Francis Rhys Ward
Abstract:
Capability evaluations are required to understand and regulate AI systems that maybe deployed or further developed. Therefore, it is important that evaluations providean accurate estimation of an AI system’s capabilities. However, in numerous cases,previously latent capabilities have been elicited from models, sometimes longafter initial release. Accordingly, substantial efforts have been made to developmethods for eliciting latent capabilities from models. In this paper, we evaluate theeffectiveness of capability elicitation techniques by intentionally training modelorganisms – language models with hidden capabilities that are revealed by apassword. We introduce a novel method for training model organisms, basedon circuitbreaking, which is more robust to elicitation techniques than standardpassword-locked models. We focus on elicitation techniques based on promptingand activation steering, and compare these to fine-tuning methods. Promptingtechniques can elicit the actual capability of both password-locked and circuit-broken model organisms in an MCQA setting, while steering fails to do so. Fora code-generation task, only fine-tuning can elicit the hidden capabilities of ournovel model organism. Additionally, our results suggest that combining techniquesimproves elicitation. Still, if possible, fine-tuning should be the method of choiceto improve the trustworthiness of capability evaluations.
Paperid:906
Authors:Laines Schmalwasser · Niklas Penzel · Joachim Denzler · Julia Niebling
Abstract:
Concepts such as objects, patterns, and shapes are how humans understand the world. Building on this intuition, conceptbased explainability methods aim to study representations learned by deep neural networks in relation to human-understandable concepts. Here, Concept Activation Vectors (CAVs) are an important tool and can identify whether a model learned a concept or not. However, the computational cost and time requirements of existing CAV computation pose a significant challenge, particularly in large-scale, high-dimensional architectures. To address this limitation, we introduce FastCAV, a novel approach that accelerates the extraction of CAVs by up to 63.6× (on average 46.4×). We provide a theoretical foundation for our approach and give concrete assumptions under which it is equivalent to established SVM-based methods. Our empirical results demonstrate that CAVs calculated with FastCAV maintain similar performance while being more efficient and stable. In downstream applications, i.e., concept-based explanation methods, we show that FastCAV can act as a replacement leading to equivalent insights. Hence, our approach enables previously infeasible investigations of deep models, which we demonstrate by tracking the evolution of concepts during model training.
Paperid:907
Authors:Benjamin Eyre · David Madras
Abstract:
The availability of machine learning systems that can effectively perform arbitrary tasks has led to synthetic labels from these systems being used in applications of statistical inference, such as data analysis or model evaluation. The Prediction Powered Inference (PPI) framework provides a way of leveraging both a large pool of pseudolabelled data and a small sample with real, high-quality labels to produce a low-variance, unbiased estimate of the quantity being evaluated for. Most work on PPI considers a relatively sizable set of labelled samples, which can be resource intensive to obtain. However, we find that when labelled data is scarce, the PPI++ method can perform even worse worse than classical inference. We analyze this phenomenon by relating PPI++ to ordinary least squares regression, which also experiences high variance with small sample sizes, and use this regression framework to better understand the efficacy of PPI. Motivated by this, we present two new PPI-based techniques that leverage robust regressors to produce even lower variance estimators in the few-label regime
Paperid:908
Authors:Shahaf Bassan · Guy Amir · Meirav Zehavi · Guy Katz
Abstract:
Abstract:Ensemble models are widely recognized in the ML community for their limited interpretability. For instance, while a single decision tree is considered interpretable, ensembles of trees (e.g., boosted trees) are often treated as blackboxes. Despite this folklore recognition, there remains a lack of rigorous mathematical understanding of what particularly makes an ensemble (un)-interpretable, including how fundamental factors like the (1) *number*, (2) *size*, and (3) *type* of base models influence its interpretability. In this work, we seek to bridge this gap by applying concepts from computational complexity theory to study the challenges of generating explanations for various ensemble configurations. Our analysis uncovers nuanced complexity patterns influenced by various factors. For example, we demonstrate that under standard complexity assumptions like P$\neq$NP, interpreting ensembles remains intractable even when base models are of constant size. Surprisingly, the complexity changes drastically with the number of base models: small ensembles of decision trees are efficiently interpretable, whereas ensembles of linear models remain intractable, even with a constant number of models. We believe that our findings provide a more robust foundation for understanding the interpretability of ensembles, emphasizing the benefits of examining it through a computational complexity lens.
Paperid:909
Authors:Sumeet Batra · Gaurav Sukhatme
Abstract:
Generalizing visionbased reinforcement learning (RL) agents to novel environments remains a difficult and open challenge. Current trends are to collect large-scale datasets or use data augmentation techniques to prevent overfitting and improve downstream generalization. However, the computational and data collection costs increase exponentially with the number of task variations and can destabilize the already difficult task of training RL agents. In this work, we take inspiration from recent advances in computational neuroscience and propose a model, Associative Latent DisentAnglement (ALDA), that builds on standard off-policy RL towards zero-shot generalization. Specifically, we revisit the role of latent disentanglement in RL and show how combining it with a model of associative memory achieves zero-shot generalization on difficult task variationswithoutrelying on data augmentation. Finally, we formally show that data augmentation techniques are a form of weak disentanglement and discuss the implications of this insight.
Paperid:910
Authors:Zuchao Li · Yonghua Hei · Qiwei Li · Lefei Zhang · Ping Wang · hai zhao · qi baoyuan · Liu Guoming
Abstract:
Large Language Models excel in generation tasks. Recent studies show that causal attention mechanisms limit their perfermance in embedding tasks and bidirectional modeling may enhance the performance. However, merely bidirectional finetuning unidirectional models leads to a sharp decline in generative performance. To investigate the reasons for this performance drop, we consider attention weights as dependence and observe that after bidirectional fine-tuning, the model's subsequent dependence increases, thus impairing its unidirectional generative capabilities. We tested various modules of the Transformer architecture and the results show that the FNN layer is least affected by subsequent dependence in bidirectional fine-tuning. Based on this finding, we propose UBMoE-LLM, a novel uni-bi-directional mixture-of-expert Large Language Model to solve above problem. It combines the original FNN layer of the unidirectional model with the bidirectional FNN layer trained by unsupervised contrastive learning by MoE which can improve the embedding performance of instruction finetuned models while maintaining robust generative performance. We conducted experiments on rich datasets and models of different scales and kinds. The experimental results fully verified the rationality of our proposed attention dependence indicator and reflected UBMoE-LLM's excellent generative ability and hallucination-resisting ability.
Paperid:911
Authors:Yoann Boget
Abstract:
Discrete Diffusion and Flow Matching models have significantly advanced generative modeling for discrete structures, including graphs. However, the time dependencies in the noising process of these models lead to error accumulation and propagation during the backward process. This issue, particularly pronounced in mask diffusion, is a known limitation in sequence modeling and, as we demonstrate, also impacts discrete diffusion models for graphs.To address this problem, we propose a novel framework calledIterative Denoising, which simplifies discrete diffusion and circumvents the issue by assuming conditional independence across time. Additionally, we enhance our model by incorporating aCritic, which during generation selectively retains or corrupts elements in an instance based on their likelihood under the data distribution. Our empirical evaluations demonstrate that the proposed method significantly outperforms existing discrete diffusion baselines in graph generation tasks.
Paperid:912
Authors:Sheyang Tang · xiaoyu xu · Jiayan Qiu · Zhou Wang
Abstract:
Implicit Neural Representations (INRs) represent data as continuous functions using the parameters of a neural network, where data information is encoded in the parameter space. Therefore, modeling the distribution of such parameters is crucial for building generalizable INRs. Existing approaches learn a joint distribution of these parameters via a latent vector to generate new data, but such a flat latent often fails to capture the inherent hierarchical structure of the parameter space, leading to entangled data semantics and limited control over the generation process. Here, we propose aControllableHierarchicalImplicitNeuralRepresentation (CHINR) framework, which explicitly models conditional dependencies across layers in the parameter space. Our method consists of two stages: In Stage1, we construct a Layers-of-Experts (LoE) network, where each layer modulates distinct semantics through a unique latent vector, enabling disentangled and expressive representations. In Stage-2, we introduce a Hierarchical Conditional Diffusion Model (HCDM) to capture conditional dependencies across layers, allowing for controllable and hierarchical data generation at various semantic granularities. Extensive experiments across different modalities demonstrate that CHINR improves generalizability and offers flexible hierarchical control over the generated content.
Paperid:913
Authors:Ziqiao Wang · Cheng Long · Yongyi Mao
Abstract:
Federated Learning (FL) is a widely adopted privacypreserving distributed learning framework, yet its generalization performance remains less explored compared to centralized learning. In FL, the generalization error consists of two components: the out-of-sample gap, which measures the gap between the empirical and true risk for participating clients, and the participation gap, which quantifies the risk difference between participating and non-participating clients. In this work, we apply an information-theoretic analysis via the conditional mutual information (CMI) framework to study FL's two-level generalization. Beyond the traditional supersample-based CMI framework, we introduce a superclient construction to accommodate the two-level generalization setting in FL. We derive multiple CMI-based bounds, including hypothesis-based CMI bounds, illustrating how privacy constraints in FL can imply generalization guarantees. Furthermore, we propose fast-rate evaluated CMI bounds that recover the best-known convergence rate for two-level FL generalization in the small empirical risk regime. For specific FL model aggregation strategies and structured loss functions, we refine our bounds to achieve improved convergence rates with respect to the number of participating clients. Empirical evaluations confirm that our evaluated CMI bounds are non-vacuous and accurately capture the generalization behavior of FL algorithms.
Paperid:914
Authors:Rickard K.A. Karlsson · Jesse H. Krijthe
Abstract:
A major challenge in estimating treatment effects in observational studies is the reliance on untestable conditions such as the assumption of no unmeasured confounding. In this work, we propose an algorithm that can falsify the assumption of no unmeasured confounding in a setting with observational data from multiple heterogeneous sources, which we refer to as environments. Our proposed falsification strategy leverages a key observation that unmeasured confounding can cause observed causal mechanisms to appear dependent. Building on this observation, we develop a novel twostage procedure that detects these dependencies with high statistical power while controlling false positives. The algorithm does not require access to randomized data and, in contrast to other falsification approaches, functions even under transportability violations when the environment has a direct effect on the outcome of interest. To showcase the practical relevance of our approach, we show that our method is able to efficiently detect confounding on both simulated and real-world data.
Paperid:915
Authors:Ronak Mehta · Zaid Harchaoui
Abstract:
A clever, modern approach to machine learning and AI takes a peculiar yet effective learning path involving two stages: from an upstream pretraining task using unlabeled multimodal data (foundation modeling), to a downstream task using prompting in natural language as a replacement for training data (zero-shot prediction). We cast this approach in a theoretical framework that allows us to identify the key quantities driving both its success and its pitfalls. We obtain risk bounds identifying the residual dependence lost between modalities, the number and nature of prompts necessary for zero-shot prediction, and the discrepancy of this approach with classical single-stage machine learning.
Paperid:916
Authors:Mozhi Zhang · Howe Tissue · Lu Wang · Xipeng Qiu
Abstract:
Abstract:In this paper, we introduce *Domain2Vec*, a novel approach that decomposes any dataset into a linear combination of several "MetaDomains", a new concept designed to capture key underlying features of datasets.*Domain2Vec* maintains a vocabulary of Meta-Domains and uses a Meta-Domain Classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary.These domain vectors enable the identification of optimal data mixture ratio for LM pretraining in a training-free manner under the ***D**istribution **A**lignment **A**ssumption* (DA$^{2}$), which suggests that when the data distribution of the training set and the validation set is more aligned, a lower validation loss is achieved.Moreover, previous work could use *Domain2vec* to model the relationship between domain vectors and LM performance, greatly enhancing the scalability of previous methods without retraining as new datasets are introduced.Extensive experiments demonstrate that *Domain2Vec* finds data mixture ratios that enhance downstream task performance with minimal computational overhead.Specifically, *Domain2Vec* achieves the same validation loss on Pile-CC using only $51.5$\% of the compute required when training on the original mixture of The Pile Dataset.Under equivalent compute budget, *Domain2Vec* improves downstream performance by an average of $2.83$\%.
Paperid:917
Authors:Yu Chen · Nathalia Céspedes · Payam Barnaghi
Abstract:
Timeseries forecasting is crucial across various domains, including finance, healthcare, and energy. Transformer models, originally developed for natural language processing, have demonstrated significant potential in addressing challenges associated with time-series data. These models utilize different tokenization strategies, point-wise, patch-wise, and variate-wise, to represent time-series data, each resulting in different scope of attention maps. Despite the emergence of sophisticated architectures, simpler transformers consistently outperform their more complex counterparts in widely used benchmarks. This study examines why point-wise transformers are generally less effective, why intra- and inter-variate attention mechanisms yield similar outcomes, and which architectural components drive the success of simpler models. By analyzing mutual information and evaluating models on synthetic datasets, we demonstrate that intra-variate dependencies are the primary contributors to prediction performance on benchmarks, while inter-variate dependencies have a minor impact. Additionally, techniques such as Z-score normalization and skip connections are also crucial. However, these results are largely influenced by the self-dependent and stationary nature of benchmark datasets. By validating our findings on real-world healthcare data, we provide insights for designing more effective transformers for practical applications.
Paperid:918
Authors:Boyan Li · Jiayi Zhang · Ju Fan · Yanwei XU · Chong Chen · Nan Tang · Yuyu Luo
Abstract:
Textto-SQL, which enables natural language interaction with databases, serves as a pivotal method across diverse industries.With new, more powerful large language models (LLMs) emerging every few months, fine-tuning has become incredibly costly, labor-intensive, and error-prone. As an alternative,zero-shotText-to-SQL, which leverages the growing knowledge and reasoning capabilities encoded in LLMs without task-specific fine-tuning, presents a promising and more challenging direction.To address this challenge, we propose Alpha-SQL, a novel approach that leverages a Monte Carlo Tree Search (MCTS) framework to iteratively infer SQL construction actions based on partial SQL query states. To enhance the framework’s reasoning capabilities, we introduceLLM-as-Action-Modelto dynamically generate SQL constructionactionsduring the MCTS process, steering the search toward more promising SQL queries. Moreover, Alpha-SQL employs a self-supervised reward function to evaluate the quality of candidate SQL queries, ensuring more accurate and efficient query generation. Experimental results show that Alpha-SQL achieves 69.7% execution accuracy on the BIRD development set, using a 32B open-source LLM without fine-tuning. Alpha-SQL outperforms the best previous zero-shot approach based on GPT-4o by 2.5% on the BIRD development set.
Paperid:919
Authors:Jessica Dai · Nika Haghtalab · Eric Zhao
Abstract:
Abstract:A canonical desideratum for prediction problems is that performance guarantees should hold not just on average over the population, but also for meaningful subpopulations within the overall population. But what constitutes a meaningful subpopulation? In this work, we take the perspective that relevant subpopulations should be defined with respect to the clusters that naturally emerge from the distribution of individuals for which predictions are being made. In this view, a population refers to a mixture model whose components constitute the relevant subpopulations. We suggest two formalisms for capturing persubgroup guarantees: first, by attributing each individual to the component from which they were most likely drawn, given their features; and second, by attributing each individual to all componentsin proportion to their relative likelihood of having been drawn from each component. Using online calibration as a case study, we study a multi-objective algorithm that provides guarantees for each of these formalisms by handling all plausible underlying subpopulation structures simultaneously, and achieve an $O(T^{1/2})$ rate even when the subpopulations are not well-separated. In comparison, the more natural cluster-then-predict approach that first recovers the structure of the subpopulations and then makes predictions suffers from a $O(T^{2/3})$ rate and requires the subpopulations to be separable. Along the way, we prove that providing per-subgroup calibration guarantees for underlying clusters can be easier than learning the clusters: separation between median subgroup features is required for the latter but not the former.
Paperid:920
Authors:Shiyuan Feng · Ying Feng · George Li · Zhao Song · David Woodruff · Lichen Zhang
Abstract:
Abstract:Recently differential privacy has been used for a number of streaming, data structure, and dynamic graph problems as a means of hiding the internal randomness of the data structure, so that multiple possibly adaptive queries can be made without sacrificing the correctness of the responses. Although these works use differential privacy to show that for some problems it is possible to tolerate $T$ queries using $\widetilde{O}(\sqrt{T})$ copies of a data structure, such results only apply to {\it numerical estimation problems}, and only return the {\it cost} of an optimization problem rather than the solution itself. In this paper we investigate the use of differential privacy for adaptive queries to {\it search} problems, which are significantly more challenging since the responses to queries can reveal much more about the internal randomness than a single numerical query. We focus on two classical search problems: nearest neighbor queries and regression with arbitrary turnstile updates. We identify key parameters to these problems, such as the number of $c$approximate near neighbors and the matrix condition number, and use different differential privacy techniques to design algorithms returning the solution point or solution vector with memory and time depending on these parameters. We give algorithms for each of these problems that achieve similar tradeoffs.
Paperid:921
Authors:Gauthier Thurin · Kimia Nadjahi · Claire Boyer
Abstract:
Conformal Prediction (CP) is a principled framework for quantifying uncertainty in blackbox learning models, by constructing prediction sets with finite-sample coverage guarantees. Traditional approaches rely on scalar nonconformity scores, which fail to fully exploit the geometric structure of multivariate outputs, such as in multi-output regression or multiclass classification. Recent methods addressing this limitation impose predefined convex shapes for the prediction sets, potentially misaligning with the intrinsic data geometry. We introduce a novel CP procedure handling multivariate score functions through the lens of optimal transport. Specifically, we leverage Monge-Kantorovich vector ranks and quantiles to construct prediction region with flexible, potentially non-convex shapes, better suited to the complex uncertainty patterns encountered in multivariate learning tasks. We prove that our approach ensures finite-sample, distribution-free coverage properties, similar to typical CP methods. We then adapt our method for multi-output regression and multiclass classification, and also propose simple adjustments to generate adaptive prediction regions with asymptotic conditional coverage guarantees. Finally, we evaluate our method on practical regression and classification problems, illustrating its advantages in terms of (conditional) coverage and efficiency.
Paperid:922
Authors:Zhenting Wang · Chen Chen · Vikash Sehwag · Minzhou Pan · Lingjuan Lyu
Abstract:
The popularity of visual generative AI models like DALLE 3, Stable Diffusion XL, Stable Video Diffusion, and Sora has been increasing. Through extensive evaluation, we discovered that the state-of-the-art visual generative models can generate content that bears a striking resemblance to characters protected by intellectual property rights held by major entertainment companies (such as Sony, Marvel, and Nintendo), which raises potential legal concerns. This happens when the input prompt contains the character's name or even just descriptive details about their characteristics. To mitigate such IP infringement problems, we also propose a defense method against it. In detail, we develop a revised generation paradigm that can identify potentially infringing generated content and prevent IP infringement by utilizing guidance techniques during the diffusion process. It has the capability to recognize generated content that may be infringing on intellectual property rights, and mitigate such infringement by employing guidance methods throughout the diffusion process without retrain or fine-tune the pretrained models. Experiments on well-known character IPs like Spider-Man, Iron Man, and Superman demonstrate the effectiveness of the proposed defense method.
Paperid:923
Authors:Martin Andrews · Sam Witteveen
Abstract:
Cryptic crossword clues are challenging language tasks for which new test sets arereleased daily by major newspapers on a global basis. Each cryptic clue containsboth the definition of the answer to be placed in the crossword grid (in commonwith regular crosswords), and ‘wordplay’ thatprovesthat the answer is correct (i.e.a human solver can be confident that an answer is correct without needing crossing words as confirmation). This work describes an LLMbased reasoning systembuilt from open-licensed components that solves cryptic clues by (i) hypothesising answers; (ii) proposing wordplay explanations; and (iii) using a verifier systemthat operates on codified reasoning steps. Overall, this system establishes a newstate-of-the-art performance on the challenging Cryptonite dataset of clues fromThe Times and The Telegraph newspapers in the UK. Because each proved solution is expressed in Python, interpretable wordplay reasoning for proven answersis available for inspection.
Paperid:924
Authors:Anqi Tang · Youming Chen · Shuchen Xue · Zhaoqiang Liu
Abstract:
Diffusion models (DMs) have demonstrated remarkable ability to generate diverse and highquality images by efficiently modeling complex data distributions. They have also been explored as powerful generative priors for signal recovery, resulting in a substantial improvement in the quality of reconstructed signals. However, existing research on signal recovery with diffusion models either focuses on specific reconstruction problems or is unable to handle nonlinear measurement models with discontinuous or unknown link functions. In this work, we focus on using DMs to achieve accurate recovery from semi-parametric single index models, which encompass a variety of popular nonlinear models that may have {\em discontinuous} and {\em unknown} link functions. We propose an efficient reconstruction method that only requires one round of unconditional sampling and (partial) inversion of DMs. Theoretical analysis on the effectiveness of the proposed methods has been established under appropriate conditions. We perform numerical experiments on image datasets for different nonlinear measurement models. We observe that compared to competing methods, our approach can yield more accurate reconstructions while utilizing significantly fewer neural function evaluations.
Paperid:925
Authors:Zhuoran Zhang · Yongxiang Li · Zijian Kan · Keyuan Cheng · Lijie Hu · Di Wang
Abstract:
The locatethen-edit paradigm has shown significant promise for knowledge editing (KE) in Large Language Models (LLMs). While previous methods perform well on single-hop fact recall tasks, they consistently struggle with multi-hop factual recall tasks involving newly edited knowledge. In this paper, leveraging tools in mechanistic interpretability, we first identify that in multi-hop tasks, LLMs tend to retrieve knowledge with implicit subject information from deeper MLP layers, unlike single-hop tasks, which rely on shallow layers. This distinction explains the poor performance of current methods in multi-hop queries, as they primarily focus on editing shallow layers with single-hop edit prompts, leaving deeper layers unchanged. To address this, we propose IFMET, a novel locate-then-edit KE approach designed to edit both shallow and deep MLP layers. Beyond single-hop editing prompts, IFMET further incorporates multi-hop editing prompts to locate and modify knowledge across different stages of reasoning. Experimental results demonstrate that IFMET significantly improves performance on multi-hop factual recall tasks, overcoming the limitations of previous locate-then-edit methods.
Paperid:926
Authors:Undral Byambadalai · Tomu Hirata · Tatsushi Oka · Shota Yasui
Abstract:
This paper focuses on the estimation of distributional treatment effects in randomized experiments that use covariateadaptive randomization (CAR). These include designs such as Efron's biased-coin design and stratified block randomization, where participants are first grouped into strata based on baseline covariates and assigned treatments within each stratum to ensure balance across groups. In practice, datasets often contain additional covariates beyond the strata indicators. We propose a flexible distribution regression framework that leverages off-the-shelf machine learning methods to incorporate these additional covariates, enhancing the precision of distributional treatment effect estimates. We establish the asymptotic distribution of the proposed estimator and introduce a valid inference procedure. Furthermore, we derive the semiparametric efficiency bound for distributional treatment effects under CAR and demonstrate that our regression-adjusted estimator attains this bound. Simulation studies and analyses of real-world datasets highlight the practical advantages of our method.
Paperid:927
Authors:Shiqi Chen · Tongyao Zhu · Ruochen Zhou · Jinghan Zhang · Siyang Gao · Juan Carlos Niebles · Mor Geva · Junxian He · Jiajun Wu · Manling Li
Abstract:
Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing “under” or “behind” relationships between only two objects, pose significant challenges for current VLMs. We believe it is crucial to use the lens of mechanism interpretability, opening up the model and diving into model’s internal states to examine the interactions between image and text tokens during spatial reasoning. Our analysis of attention behaviors reveals significant differences in how VLMs allocate attention to image versus text. By tracing the areas of images that receive the highest attention scores throughout intermediate layers, we observe a notable pattern: errors often coincide with attention being misdirected towards irrelevant objects within the image. Moreover, such attention patterns exhibit substantial differences between familiar (e.g., “on the left side of ”) and unfamiliar (e.g.,“in front of ”) spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inferencetime confidence scores to sharpen the attention on highly relevant regions when the model exhibits high confidence, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible additional cost.
Paperid:928
Authors:Tung Luu Minh · Younghwan Lee · Donghoon Lee · Sunho Kim · MinJun Kim · Chang Yoo
Abstract:
Designing effective reward functions remains a fundamental challenge in reinforcement learning (RL), as it often requires extensive human effort and domain expertise. While RL from human feedback has been successful in aligning agents with human intent, acquiring highquality feedback is costly and labor-intensive, limiting its scalability. Recent advancements in foundation models present a promising alternative--leveraging AI-generated feedback to reduce reliance on human supervision in reward learning. Building on this paradigm, we introduce ERL-VLM, an enhanced rating-based RL method that effectively learns reward functions from AI feedback. Unlike prior methods that rely on pairwise comparisons, ERL-VLM queries large vision-language models (VLMs) for absolute ratings of individual trajectories, enabling more expressive feedback and improved sample efficiency. Additionally, we propose key enhancements to rating-based RL, addressing instability issues caused by data imbalance and noisy labels. Through extensive experiments across both low-level and high-level control tasks, we demonstrate that ERL-VLM significantly outperforms existing VLM-based reward generation methods. Our results demonstrate the potential of AI feedback for scaling RL with minimal human intervention, paving the way for more autonomous and efficient reward learning.
Paperid:929
Authors:Atefeh Sohrabizadeh · Jialin Song · Mingjie Liu · Rajarshi Roy · Chankyu Lee · Jonathan Raiman · Bryan Catanzaro
Abstract:
Large Language Models (LLMs) have demonstrated significant potential in code generation by following natural language instructions. Unfortunately, crucial realworld software engineering tasks, such as debugging or repository-level feature implementation, involve processing extensive contexts beyond current LLM context sizes and performing complex logic reasoning that is brittle using standard autoregressive decoding. Enhancing LLMs' performance in these scenarios requires careful consideration of the contextual information provided to the model, optimizing how the model leverages this information, and identifying tools that enable more effective navigation of the development environment. To address these challenges, we introduce CORTEXA, an agentic system designed to enhance LLMs' ability to navigate and reason in complex software engineering contexts. Specifically, we develop a novel code embedding model that retrieves the most relevant files with greater precision, along with a localization agent that refines the granularity of the retrieval process. Additionally, we demonstrate that providing diverse contextual information and utilizing different prompt formats enable the model to identify and resolve issues more efficiently. We evaluate CORTEXA using the SWE-bench dataset \cite{swebench}, a benchmark derived from real-world GitHub issues. Compared to the widely used Agentless framework \cite{xia2024agentless}, CORTEXA achieves a higher issue resolution rate at a lower cost, highlighting its practical impact in addressing real-world software engineering challenges.
Paperid:930
Authors:Mohammad Hosseini · Maryam Shanechi
Abstract:
Highdimensional imaging of neural activity, such as widefield calcium and functional ultrasound imaging, provide a rich source of information for understanding the relationship between brain activity and behavior. Accurately modeling neural dynamics in these modalities is crucial for understanding this relationship but is hindered by the high-dimensionality, complex spatiotemporal dependencies, and prevalent behaviorally irrelevant dynamics in these modalities. Existing dynamical models often employ preprocessing steps to obtain low-dimensional representations from neural image modalities. However, this process can discard behaviorally relevant information and miss spatiotemporal structure. We propose SBIND, a novel data-driven deep learning framework to model spatiotemporal dependencies in neural images and disentangle their behaviorally relevant dynamics from other neural dynamics. We validate SBIND on widefield imaging datasets, and show its extension to functional ultrasound imaging, a recent modality whose dynamical modeling has largely remained unexplored. We find that our model effectively identifies both local and long-range spatial dependencies across the brain while also dissociating behaviorally relevant neural dynamics. Doing so, SBIND outperforms existing models in neural-behavioral prediction. Overall, SBIND provides a versatile tool for investigating the neural mechanisms underlying behavior using imaging modalities.
Paperid:931
Authors:Kaito Ariu · Alexandre Proutiere · Se-Young Yun
Abstract:
Abstract:In this paper, we investigate the problem of recovering hidden communities in the Labeled Stochastic Block Model (LSBM) with a finite number of clusters whose sizes grow linearly with the total number of nodes. We derive the necessary and sufficient conditions under which the expected number of misclassified nodes is less than $ s $, for any number $ s = o(n) $. To achieve this, we propose IAC (InstanceAdaptive Clustering), the first algorithm whose performance matches the instance-specific lower bounds both in expectation and with high probability.IAC is a novel two-phase algorithm that consists of a one-shot spectral clustering step followed by iterative likelihood-based cluster assignment improvements. This approach is based on the instance-specific lower bound and notably does not require any knowledge of the model parameters, including the number of clusters. By performing the spectral clustering only once, IAC maintains an overall computational complexity of $ \mathcal{O}(n\, \text{polylog}(n)) $, making it scalable and practical for large-scale problems.
Paperid:932
Authors:Jiacheng Du · Zhibo Wang · Jie Zhang · Xiaoyi Pang · Jiahui Hu · Kui Ren
Abstract:
Language Models (LMs) are prone to ''memorizing'' training data, including substantial sensitive user information. To mitigate privacy risks and safeguard the right to be forgotten, machine unlearning has emerged as a promising approach for enabling LMs to efficiently ''forget'' specific texts. However, despite the good intentions, is textual unlearning really as effective and reliable as expected? To address the concern, we first propose Unlearning Likelihood Ratio Attack+ (ULiRA+), a rigorous textual unlearning auditing method, and find that unlearned texts can still be detected with very high confidence after unlearning. Further, we conduct an in-depth investigation on the privacy risks of textual unlearning mechanisms in deployment and present the Textual Unlearning Leakage Attack (TULA), along with its variants in both black- and white-box scenarios. We show that textual unlearning mechanisms could instead reveal more about the unlearned texts, exposing them to significant membership inference and data reconstruction risks. Our findings highlight that existing textual unlearning actually gives a false sense of unlearning, underscoring the need for more robust and secure unlearning mechanisms.
Paperid:933
Authors:kejia chen · Jiawen Zhang · Jiacong Hu · Yu Wang · Mingli Song · Zunlei Feng · Jian Lou
Abstract:
Quantized large language models (LLMs) have gained increasing attention and significance for enabling deployment in resourceconstrained environments. However, emerging studies on a few calibration dataset-free quantization methods suggest that quantization may compromise the safety capabilities of LLMs, underscoring the urgent need for systematic safety evaluations and effective mitigation strategies. In this paper, we present comprehensive safety evaluations across various mainstream quantization techniques and diverse calibration datasets, utilizing widely accepted safety benchmarks. To address the identified safety vulnerabilities, we propose a quantization-aware safety patching framework, Q-resafe, to efficiently restore the safety capabilities of quantized LLMs while minimizing any adverse impact on utility. Extensive experiment results demonstrate that Q-resafe successfully re-aligns the safety of quantized LLMs with their pre-quantization counterparts, even under challenging evaluation scenarios. Project page: https://anonymous.4open.science/r/Qresafe-code.
Paperid:934
Authors:Guan Huang · Tao Shu
Abstract:
Personalized Federated Learning (PFL) has become a promising learning paradigm, enabling the training of highquality personalized models through multiple communication rounds between clients and a central server. However, directly applying traditional PFL in real-world environments where communication is expensive, limited, or infeasible is challenging, as seen in Low Earth Orbit (LEO) satellite constellations, which face severe communication constraints due to their high mobility, limited contact windows. To address these issues, we introduce Federated Oriented Learning (FOL), a novel four-stage one-shot PFL algorithm designed to enhance local model performance by leveraging neighboring models within stringent communication constraints. FOL comprises model pretraining, model collection, fine-tuning, pruning, post fine-tuning, ensemble refinement, and knowledge distillation stages. We establish two theoretical guarantees on empirical risk discrepancy between student and teacher models and the convergence of the distillation process. Extensive experiments on datasets Wildfire, Hurricane, CIFAR-10 and CIFAR-100 demonstrate that FOL consistently outperforms state-of-the-art OFL methods, e.g., achieving accuracy improvements of up to 39.24\% on the Wildfire dataset.
Paperid:935
Authors:Ning Shang · Li Lyna Zhang · Siyuan Wang · Gaokai Zhang · Gilsinia Lopez · Fan Yang · Weizhu Chen · Mao Yang
Abstract:
LongRoPE2 is a novel approach that extends the effective context window of pretrained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length.
Paperid:936
Authors:Fan Li · Xuan Wang · Min Qi · Zhaoxiang Zhang · yuelei xu
Abstract:
Domain Generalized Semantic Segmentation (DGSS) trains a model on a labeled source domain to generalize to unseen target domains with consistent contextual distribution and varying visual appearance. Most existing methods rely on domain randomization or data generation but struggle to capture the underlying scene distribution, resulting in the loss of useful semantic information. Inspired by the diffusion model's capability to generate diverse variations within a given scene context, we consider harnessing its rich prior knowledge of scene distribution to tackle the challenging DGSS task. In this paper, we propose a novel agent \textbf{Query}driven learning framework based on \textbf{Diff}usion model guidance for DGSS, named QueryDiff. Our recipe comprises three key ingredients: (1) generating agent queries from segmentation features to aggregate semantic information about instances within the scene; (2) learning the inherent semantic distribution of the scene through agent queries guided by diffusion features; (3) refining segmentation features using optimized agent queries for robust mask predictions. Extensive experiments across various settings demonstrate that our method significantly outperforms previous state-of-the-art methods. Notably, it enhances the model's ability to generalize effectively to extreme domains, such as cubist art styles.
Paperid:937
Authors:Keyu Duan · Yiran Zhao · Zhili Feng · Jinjie Ni · Tianyu Pang · Qian Liu · Tianle Cai · Longxu Dou · Kenji Kawaguchi · Anirudh Goyal · Zico Kolter · Michael Shieh
Abstract:
Large Language Models (LLMs) have been observed to process nonhuman-readable text sequences, such as jailbreak prompts, often viewed as a bug for aligned LLMs. In this work, we present a systematic investigation challenging this perception, demonstrating that unnatural languages - strings that appear incomprehensible to humans but maintain semantic meanings for LLMs - contain latent features usable by models. Notably, unnatural languages possess latent features that can be generalized across different models and tasks during inference. Furthermore, models fine-tuned on unnatural versions of instruction datasets perform on-par with those trained on natural language, achieving (49.71) win rates in Length-controlled AlpacaEval 2.0 in average across various base models. In addition, through comprehensive analysis, we demonstrate that LLMs process unnatural languages by filtering noise and inferring contextual meaning from filtered words.
Paperid:938
Authors:Bairu Hou · Qibin Chen · Jianyu Wang · Guoli Yin · Chong Wang · Nan Du · Ruoming Pang · Shiyu Chang · Tao Lei
Abstract:
With the rapid scaling of large language models (LLMs), structured pruning has become a widely used technique to learn efficient, smaller models from larger ones, delivering superior performance compared to training similarly sized models from scratch. In this paper, we move beyond the traditional static pruning approach of determining a fixed pruning mask for a model, and propose a dynamic approach to structured pruning. In our method, the pruning mask is inputdependent and adapts dynamically based on the information described in a user instruction. Our approach, termed "instruction-following pruning'', introduces a sparse mask predictor that takes the user instruction as input and dynamically selects the most relevant model parameters for the given task. To identify and activate effective parameters, we jointly optimize the sparse mask predictor and the LLM, leveraging both instruction-following data and the pre-training corpus. Experimental results demonstrate the effectiveness of our approach on a wide range of evaluation benchmarks. For example, our 3B activated model improves over the 3B dense model by 5-8 points of absolute margin on domains such as math and coding, and rivals the performance of a 9B model.
Paperid:939
Authors:Yuan Deng · Amin Karbasi · Vahab Mirrokni · Renato Leme · Grigoris Velegkas · Song Zuo
Abstract:
Abstract:We study the problem of procurement auctions, in which an auctioneer seeks to acquire services from a group of strategic sellers with private costs. The quality of the services is measured through some submodular function that is known to the auctioneer. Our goal is to design computationally efficient procurement auctions that (approximately) maximize the difference between the quality of the acquired services and the total cost of the sellers, in a way that is incentive compatible (IC) and individual rational (IR) for the sellers, and generates nonnegative surplus (NAS) for the auctioneer. {Our contribution is twofold: \textbf{i)} we provide an improved analysis of existing algorithms for non-positive submodular function maximization and \textbf{ii)} we design computationally efficient frameworks that transform submodular function optimization algorithms to mechanisms that are IC and IR for the sellers, NAS for the auctioneer, and approximation-preserving.} Our frameworks are general and work both in the offline setting where the auctioneer can observe the bids and the services of all the sellers simultaneously, and in the online setting where the sellers arrive in an adversarial order and the auctioneer has to make an irrevocable decision whether to purchase their service or not. We further investigate whether it is possible to convert state-of-art submodular optimization algorithms into descending auctions. We focus on the adversarial setting, meaning that the schedule of the descending prices is determined by an adversary. We show that a submodular optimization algorithm satisfying bi-criteria $(1/2,1)$-approximation in welfare can be effectively converted to a descending auction in this setting. We further establish a connection between descending auctions and online submodular optimization. Finally, we demonstrate the practical applications of our frameworks by instantiating them with different state-of-the-art submodular optimization algorithms and comparing their welfare performance through empirical experiments on publicly available datasets that consist of thousands of sellers.
Paperid:940
Authors:Marcin Sendera · Łukasz Struski · Kamil Książek · Kryspin Musiol · Jacek Tabor · Dawid Rymarczyk
Abstract:
While the capabilities of generative foundational models have advanced rapidly in recent years, methods to prevent harmful and unsafe behaviors remain underdeveloped. Among the pressing challenges in AI safety, machine unlearning (MU) has become increasingly critical to meet upcoming safety regulations. Most existing MU approaches focus on altering the most significant parameters of the model. However, these methods often require finetuning substantial portions of the model, resulting in high computational costs and training instabilities, which are typically mitigated by access to the original training dataset.In this work, we address these limitations by leveraging Singular Value Decomposition (SVD) to create a compact, low-dimensional projection that enables the selective forgetting of specific data points. We propose Singular Value Decomposition for Efficient Machine Unlearning (SEMU), a novel approach designed to optimize MU in two key aspects. First, SEMU minimizes the number of model parameters that need to be modified, effectively removing unwanted knowledge while making only minimal changes to the model's weights. Second, SEMU eliminates the dependency on the original training dataset, preserving the model's previously acquired knowledge without additional data requirements.Extensive experiments demonstrate that SEMU achieves competitive performance while significantly improving efficiency in terms of both data usage and the number of modified parameters.
Paperid:941
Authors:Olga Ovcharenko · Philip Toma · Imant Daunhawer · Julia Vogt · Sebastian Schelter · Florian Barkmann · Valentina Boeva
Abstract:
Selfsupervised learning (SSL) has emerged as a powerful paradigm for extracting biologically meaningful representations from single-cell data, yet systematic guidelines for single-cell SSL applications remain lacking. Here, we present a comprehensive benchmark, scSSL-Bench, evaluating twelve SSL methods across eight datasets and three critical downstream tasks: batch correction, cell-type annotation, and missing modality prediction. We furthermore systematically assess various data augmentation strategies. Our analysis reveals task-specific trade-offs: the specialized single-cell frameworks, scVI and CLAIRE, excel at uni-modal batch correction, while general SSL methods, VICReg and SimCLR, demonstrate superior performance in cell-typing task and multi-modal data integration. Random masking emerges as the most effective augmentation technique across all tasks. Notably, our results indicate the need for a specialized single-cell multi-modal data integration framework. scSSL-Bench provides a standardized evaluation platform and concrete recommendations for applying SSL to single-cell analysis, advancing the convergence of deep learning and single-cell genomics.
Paperid:942
Authors:Hao Zeng · Kangdao Liu · Bingyi Jing · Hongxin Wei
Abstract:
Conformal prediction is a popular framework of uncertainty quantification that constructs prediction sets with coverage guarantees. To uphold the exchangeability assumption, many conformal prediction methods necessitate an additional holdout set for parameter tuning. Yet, the impact of violating this principle on coverage remains underexplored, making it ambiguous in practical applications. In this work, we empirically find that the tuning bias - the coverage gap introduced by leveraging the same dataset for tuning and calibration, is negligible for simple parameter tuning in many conformal prediction methods. In particular, we observe the scaling law of the tuning bias: this bias increases with parameter space complexity and decreases with calibration set size. Formally, we establish a theoretical framework to quantify the tuning bias and provide rigorous proof for the scaling law of the tuning bias by deriving its upper bound. In the end, we discuss how to reduce the tuning bias, guided by the theories we developed.
Paperid:943
Authors:Zixing Song · Ziqiao Meng · Jose Miguel Hernandez-Lobato
Abstract:
Proteolysistargeting chimeras (PROTACs) are a groundbreaking technology for targeted protein degradation, but designing effective linkers that connect two molecular fragments to form a drug-candidate PROTAC molecule remains a key challenge. While diffusion models show promise in molecular generation, current diffusion models for PROTAC linker design are typically trained on small molecule datasets, introducing distribution mismatches in the chemical space between small molecules and target PROTACs. Direct fine-tuning on limited PROTAC datasets often results in overfitting and poor generalization. In this work, we propose DAD-PROTAC, a domain-adapted diffusion model for PROTAC linker design, which addresses this distribution mismatch in chemical space through density ratio estimation to bridge the gap between small-molecule and PROTAC domains. By decomposing the target score estimator into a pre-trained score function and a lightweight score correction term, DAD-PROTAC achieves efficient fine-tuning without full retraining. Experimental results demonstrate its superior ability to generate high-quality PROTAC linkers.
Paperid:944
Authors:Hao Zhou · Xu Yang · Mingyu Fan · Lu Qi · Xiangtai Li · Ming-Hsuan Yang · Fei Luo
Abstract:
With the growing interest in embodied and spatial intelligence, accurately predicting trajectories in 3D environments has become increasingly critical. However, no datasets have been explicitly designed to study 3D trajectory prediction. To this end, we contribute a 3D motion trajectory (3DMoTraj) dataset collected from unmanned underwater vehicles (UUVs) operating in oceanic environments. Mathematically, trajectory prediction becomes significantly more complex when transitioning from 2D to 3D. To tackle this challenge, we analyze the prediction complexity of 3D trajectories and propose a new method consisting of two key components: decoupled trajectory prediction and correlated trajectory refinement. The former decouples interaxis correlations, thereby reducing prediction complexity and generating coarse predictions. The latter refines the coarse predictions by modeling their inter-axis correlations. Extensive experiments show that our method significantly improves 3D trajectory prediction accuracy and outperforms state-of-the-art methods. Both the 3DMoTraj dataset and the code of our method will be made publicly available.
Paperid:945
Authors:Nikolaj Jensen · Jamie Vicary
Abstract:
Abstract:In this work, we propose a new architectureagnostic method for training idempotent neural networks. An idempotent operator satisfies $f(x) = f(f(x))$, meaning it can be applied iteratively with no effect beyond the first application. Some neural networks used in data transformation tasks, such as image generation and augmentation, can represent non-linear idempotent projections. Using methods from perturbation theory we derive the recurrence relation ${\mathbf{K}' \leftarrow 3\mathbf{K}^2 - 2\mathbf{K}^3}$ for iteratively projecting a real-valued matrix $\mathbf{K}$ onto the manifold of idempotent matrices. Our analysis shows that for linear, single-layer MLP networks this projection 1) has idempotent fixed points, and 2) is attracting only around idempotent points. We give an extension to non-linear networks by considering our approach as a substitution of the gradient for the canonical loss function, achieving an architecture-agnostic training scheme. We provide experimental results for MLP- and CNN-based architectures with significant improvement in idempotent error over the canonical gradient-based approach. Finally, we demonstrate practical applications of the method as we train a generative network successfully using only a simple reconstruction loss paired with our method.
Paperid:946
Authors:Harry Amad · Nicolás Astorga · M van der Schaar
Abstract:
Digital twins are virtual representations of realworld systems that simulate dynamics, enabling principled decision-making across potential action policies. In complex settings, the state and action variables, and surrounding data and knowledge relevant to a system can constantly change, requiring digital twins to continuously update with these changes to remain relevant. Current approaches struggle in this regard, as they require well-defined modelling environments, and they cannot adapt to novel variables without re-designs, or incorporate new information without re-training. To address this, we frame digital twinning as an in-context learning problem using large language models, enabling seamless updates to the twin at inference time. We develop CALM-DT, a Context-Adaptive Language Model-based Digital Twin that permits accurate simulation across diverse state-action spaces with in-context learning alone by utilising fine-tuned encoders for sample retrieval. We empirically demonstrate CALM-DT's competitive performance with existing digital twin approaches, and its unique ability to adapt to changes in its modelling environment without parameter updates.
Paperid:947
Authors:Yanbei Liu · Chongxu Wang · Xiao Wang · Lei Geng · Yanwei Pang · Zhitao Xiao
Abstract:
Heterogeneous Graph Neural Networks (HGNNs), have demonstrated excellent capabilities in processing heterogeneous information networks. Selfsupervised learning on heterogeneous graphs, especially contrastive self-supervised strategy, shows great potential when there are no labels. However, this approach requires the use of carefully designed graph augmentation strategies and the selection of positive and negative samples. Determining the exact level of similarity between sample pairs is non-trivial. To solve this problem, we propose a novel self-supervised Heterogeneous graph neural network with Optimal Transport (HGOT) method which is designed to facilitate self-supervised learning for heterogeneous graphs without graph augmentation strategies. Different from traditional contrastive self-supervised learning, HGOT employs optimal transport self-supervised mechanism. Specifically, we design an aggregating view (central view) to integrate the semantic information contained in the views represented by different meta-paths (branch views). Then, we introduce an optimal transport plan to identify the transport relationship between the semantics contained in the branch view and the central view. This allows the optimal transport plan between graphs to align with the representations, forcing the encoder to learn node representations that are more similar to the graph space and of higher quality. Extensive experiments on four real-world datasets demonstrate that our proposed HGOT model can achieve state-of-the-art performance on various downstream tasks. In particular, in the node classification task, HGOT achieves an average of more than 6\% improvement in accuracy compared with state-of-the-art methods.
Paperid:948
Authors:Wen-Bo Du · Hao-Yi Lei · Lue Tao · Tian-Zuo Wang · Zhi-Hua Zhou
Abstract:
In the field of machine learning (ML), an essential type of decisionrelated problem is known as AUF (Avoiding Undesired Future): if an ML model predicts an undesired outcome, how can decisions be made to prevent it? Recently, a novel framework calledrehearsal learninghas been proposed to address the AUF problem. Despite its utility in modeling uncertainty for decision-making, it remains unclearunder what conditionsandhowoptimal actions that maximize theAUF probabilitycan be identified. In this paper, we proposeCARE(CAnonical REctangle), a condition under which the maximum AUF probability can be achieved. Under the CARE condition, we present a projection-Newton algorithm to select actions and prove that the algorithm achieves superlinear convergence to the optimal one. Besides, we provide a generalization method for adopting the algorithm to AUF scenarios beyond the CARE condition. Finally, we demonstrate that a closed-form solution exists when the outcome is a singleton variable, substantially reducing the time complexity of decision-making. Experiments validate the effectiveness and efficiency of our method.
Paperid:949
Authors:Yunhak Oh · Junseok Lee · Yeongmin Kim · Sangwoo Seo · Namkyeong Lee · Chanyoung Park
Abstract:
Spatially Resolved Transcriptomics (SRT) is a cuttingedge technique that captures the spatial context of cells within tissues, enabling the study of complex biological networks. Recent graph-based methods leverage both gene expression and spatial information to identify relevant spatial domains. However, these approaches fall short in obtaining meaningful spot representations, especially for spots near spatial domain boundaries, as they heavily emphasize adjacent spots that have minimal feature differences from an anchor node. To address this, we propose Spotscape, a novel framework that introduces the Similarity Telescope module to capture global relationships between multiple spots. Additionally, we propose a similarity scaling strategy to regulate the distances between intra- and inter-slice spots, facilitating effective multi-slice integration. Extensive experiments demonstrate the superiority of Spotscape in various downstream tasks, including single-slice and multi-slice scenarios.
Paperid:950
Authors:Zichen Wang · Chuanhao Li · Huazheng Wang
Abstract:
Abstract:We investigate the problem of identifying the optimal scoring rule within the principalagent framework for online information acquisition problem. We focus on the principal's perspective, seeking to determine the desired scoring rule through interactions with the agent. To address this challenge, we propose two algorithms: OIAFC and OIAFB, tailored for fixed confidence and fixed budget settings, respectively. Our theoretical analysis demonstrates that OIAFC can extract the desired $(\epsilon, \delta)$-scoring rule with a efficient instance-dependent sample complexity or an instance-independent sample complexity. Our analysis also shows that OIAFB matches the instance-independent performance bound of OIAFC, while both algorithms share the same complexity across fixed confidence and fixed budget settings.
Paperid:951
Authors:Shiyu Wang · Mariam Avagyan · Yihan Shen · Arnaud Lamy · Tingran Wang · Szabolcs Marka · Zsuzsanna Marka · John Wright
Abstract:
Abstract:Learned denoisers play a fundamental role in various signal generation (e.g., diffusion models) and reconstruction (e.g., compressed sensing) architectures, whose success derives from their ability to leverage lowdimensional structure in data. Existing denoising methods, however, either rely on local approximations that require a linear scan of the entire dataset or treat denoising as generic function approximation problems, often sacrificing efficiency and interpretability. We consider the problem of efficiently denoising new noisy data sampled from an unknown manifold $\mathcal M$, relying only on noisy samples. This work proposes a framework for test-time efficient manifold denoising, by framing the concept of "learning-to-denoise" as *"learning-to-optimize"*. We have two technical innovations: (i) *online learning* methods which learn to optimize over the manifold of clean signals using only noisy data, effectively "growing" an optimizer one sample at a time. (ii) *mixed-order* methods which guarantee that the learned optimizers achieve global optimality, ensuring both efficiency and near-optimal denoising performance. We corroborate these claims with theoretical analyses of both the complexity and denoising performance of mixed order traversal. Our experiments on both synthetic manifolds as well as scientific and imagery data demonstrate significantly improved complexity-performance tradeoffs compared to baseline methods.
Paperid:952
Authors:Haotian Zhai · Connor Lawless · Ellen Vitercik · Liu Leqi
Abstract:
A fundamental problem in combinatorial optimization is identifying equivalent formulations, which can lead to more efficient solution strategies and deeper insights into a problem’s computational complexity. The need to automatically identify equivalence between problem formulations has grown as optimization copilotssystems that generate problem formulations from natural language descriptions---have proliferated. However, existing approaches to checking formulation equivalence lack grounding, relying on simple heuristics which are insufficient for rigorous validation. Inspired by Karp reductions, in this work we introduce quasi-Karp equivalence, a formal criterion for determining when two optimization formulations are equivalent based on the existence of a mapping between their decision variables. We propose EquivaMap, a framework that leverages large language models to automatically discover such mappings, enabling scalableand reliable equivalence verification. To evaluate our approach, we construct the first open-source dataset of equivalent optimization formulations, generated by applying transformations such as adding slack variables or valid inequalities to existing formulations. Empirically, EquivaMap significantly outperforms existing methods, achieving substantial improvements in correctly identifying formulation equivalence.
Paperid:953
Authors:Rogelio A. Mancisidor · Robert Jenssen · Shujian Yu · Michael Kampffmeyer
Abstract:
Multimodal learning with variational autoencoders (VAEs) requires estimating joint distributions to evaluate the evidence lower bound (ELBO). Current methods, the product and mixture of experts, aggregate singlemodality distributions assuming independence for simplicity, which is an overoptimistic assumption. This research introduces a novel methodology for aggregating single-modality distributions by exploiting the principle ofconsensus of dependent experts(CoDE), which circumvents the aforementioned assumption. Utilizing the CoDE method, we propose a novel ELBO that approximates the joint likelihood of the multimodal data by learning the contribution of each subset of modalities. The resulting CoDE-VAE model demonstrates better performance in terms of balancing the trade-off between generative coherence and generative quality, as well as generating more precise log-likelihood estimations. CoDE-VAE further minimizes the generative quality gap as the number of modalities increases. In certain cases, it reaches a generative quality similar to that of unimodal VAEs, which is a desirable property that is lacking in most current methods. Finally, the classification accuracy achieved by CoDE-VAE is comparable to that of state-of-the-art multimodal VAE models.
Paperid:954
Authors:Jianyu Wang · Zhiqiang Hu · Lidong Bing
Abstract:
We propose a novel prompt design paradigm that challenges conventional wisdom in large language model (LLM) prompting. While conventional wisdom prioritizes wellcrafted instructions and demonstrations for in-context learning (ICL), we show that pruning random demonstrations into seemingly incoherent "gibberish" can remarkably improve performance across diverse tasks. Notably, the "gibberish" always matches or surpasses state-of-the-art automatic prompt optimization techniques, achieving substantial gains regardless of LLM alignment. Nevertheless, discovering an effective pruning strategy is non-trivial, as existing attribution methods and prompt compression algorithms fail to deliver robust results, let alone human intuitions. In terms of this, we propose aself-discoverprompt optimization framework, PromptQuine, an evolutionary search framework that automatically searches for the pruning strategy by itself using only low-shot regimes. Much like the emergent complexity in biological systems—such as symbiosis, mutualism, and self-organization—arising in response to resource constraints, our framework evolves and refines unconventional yet highly effective prompts by leveraging only the tokens present within the context (i.e., in-context optimization). We demonstrate its effectiveness across classification, multi-choice question answering, generation and math reasoning tasks across LLMs, while optimizing significantly faster than existing automatic optimization methods in several tasks. We hope our findings can guide mechanistic studies on in-context learning, and provide a call to action, to pave the way for more open-ended search algorithms for more effective LLM prompting.
Paperid:955
Authors:Luca M. Schulze Buschoff · Konstantinos Voudouris · Elif Akata · Matthias Bethge · Josh Tenenbaum · Eric Schulz
Abstract:
Pretrained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.
Paperid:956
Authors:Jiahao Chen · Bin Qin · Jiangmeng Li · Hao Chen · Bing Su
Abstract:
Longtailed learning has garnered increasing attention due to its practical significance. Among the various approaches, the fine-tuning paradigm has gained considerable interest with the advent of foundation models. However, most existing methods primarily focus on leveraging knowledge from these models, overlooking the inherent biases introduced by the imbalanced training data they rely on. In this paper, we examine how such imbalances from pre-training affect long-tailed downstream tasks. Specifically, we find the imbalance biases inherited in foundation models on downstream task as parameter imbalance and data imbalance. During fine-tuning, we observe that parameter imbalance plays a more critical role, while data imbalance can be mitigated using existing re-balancing strategies. Moreover, we find that parameter imbalance cannot be effectively addressed by current re-balancing techniques, such as adjusting the logits, during training, unlike data imbalance. To tackle both imbalances simultaneously, we build our method on causal learning and view the incomplete semantic factor as the confounder, which brings spurious correlations between input samples and labels. To resolve the negative effects of this, we propose a novel backdoor adjustment method that learns the true causal effect between input samples and labels, rather than merely fitting the correlations in the data. Notably, we achieve an average performance increase of about 1.67% on each dataset.
Paperid:957
Authors:Slobodan Jenko · Niels Mündler · Jingxuan He · Mark Vero · Martin Vechev
Abstract:
Modern code completion engines, powered by large language models (LLMs), assist millions of developers with their strong capabilities to generate functionally correct code. Due to this popularity, it is crucial to investigate the security implications of relying on LLMbased code completion. In this work, we demonstrate that state-of-the-art black-box LLM-based code completion engines can be stealthily biased by adversaries to significantly increase their rate of insecure code generation. We present the first attack, named INSEC, that achieves this goal. INSEC works by injecting an attack string as a short comment in the completion input. The attack string is crafted through a query-based optimization procedure starting from a set of carefully designed initialization schemes. We demonstrate INSEC's broad applicability and effectiveness by evaluating it on various state-of-the-art open-source models and black-box commercial services (e.g., OpenAI API and GitHub Copilot). On a diverse set of security-critical test cases, covering 16 CWEs across 5 programming languages, INSEC increases the rate of generated insecure code by more than 50\%, while maintaining the functional correctness of generated code. INSEC is highly practical -- it requires low resources and costs less than 10 US dollars to develop on commodity hardware. Moreover, we showcase the attack's real-world deployment, by developing an IDE plug-in that stealthily injects INSEC into the GitHub Copilot extension.
Paperid:958
Authors:Saurabh Jha · Rohan Arora · Yuji Watanabe · Takumi Yanagawa · Yinfang Chen · Jackson Clark · Bhavya Bhavya · Mudit Verma · Harshit Kumar · Hirokuni Kitahara · Noah Zheutlin · Saki Takano · Divya Pathak · Felix George · Xinbo Wu · Bekir Turkkan · Gerard Vanloo · Michael Nidd · Ting Dai · Oishik Chatterjee · Pranjal Gupta · Suranjana Samanta · Pooja Aggarwal · Rong Lee · Jae-wook Ahn · Debanjana Kar · Amit Paradkar · Yu Deng · Pratibha Moogi · Prateeti Mohapatra · Naoki Abe · Chandrasekhar Narayanaswami · Tianyin Xu · Lav Varshney · Ruchi Mahindru · Anca Sailer · Laura Shwartz · Daby Sow · Nicholas Fuller · Ruchir Puri
Abstract:
Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: site reliability engineering (SRE), compliance and security operations (CISO), and financial operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. IT-Bench includes an initial set of 94 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 13.8% of SRE scenarios, 25.2% of CISO scenarios, and 0% of FinOps scenarios. We expect IT-Bench to be a key enabler of AI-driven IT automation that is correct, safe, and fast.
Paperid:959
Authors:Lee Cohen · Connie Hong · Jack Hsieh · Judy Hanwen Shen
Abstract:
In an era of increasingly capable foundation models, job seekers are turning to generative AI tools to enhance their application materials. However, unequal access to and knowledge about generative AI tools can harm both employers and candidates by reducing the accuracy of hiring decisions and introducing disparities between equally qualified candidates. To address these challenges, we introduce a new variant of the strategic classification framework tailored to manipulations performed using large language models, accommodating varying levels of manipulations and stochastic outcomes. We propose a ``twoticket'' scheme, where the hiring algorithm applies an additional manipulation to each submitted resume and considers this manipulated version together with the original submitted resume. We establish theoretical guarantees for this scheme, showing improvements for both the fairness and accuracy of hiring decisions when the true positive rate is maximized subject to a no false positives constraint. Finally, we empirically validate our framework and the performance of our two-ticket scheme on real resumes using an open-source resume screening tool.
Paperid:960
Authors:Ilias Diakonikolas · Daniel Kane · Sushrut Karmalkar · Jasper C.H. Lee · Thanasis Pittas
Abstract:
Abstract:We study the complexity of learning $k$mixtures of Gaussians ($k$-GMMs) on $\mathbb R^d$. This task is known to have complexity $d^{\Omega(k)}$ in full generality. To circumvent this exponential lower bound on the number of components, research has focused on learning families of GMMs satisfying additional structural properties. A natural assumption posits that the component weights are not exponentially small and that the components have the same unknown covariance. Recent work gave a $d^{O(\log(1/w_{\min}))}$-time algorithm for this class of GMMs, where $w_{\min}$ is the minimum weight. Our first main result is a Statistical Query (SQ) lower bound showing that this quasi-polynomial upper bound is essentially best possible, even for the special case of uniform weights. Specifically, we show that it is SQ-hard to distinguish between such a mixture and the standard Gaussian. We further explore how the distribution of weights affects the complexity of this task. Our second main result is a quasi-polynomial upper bound for the aforementioned testing task when most of the weights are uniform while a small fraction of the weights are potentially arbitrary.
Paperid:961
Authors:Chinmaya Kausik · Kevin Tan · Ambuj Tewari
Abstract:
Abstract:Leveraging offline data is an attractive way to accelerate online sequential decisionmaking. However, it is crucial to account for latent states in users or environments in the offline data, and latent bandits form a compelling model for doing so. In this light, we design end-to-end latent bandit algorithms capable of handing uncountably many latent states. We focus on a linear latent contextual bandit — a linear bandit where each user has its own high-dimensional reward parameter in $\mathbb{R}^{d_A}$, but reward parameters across users lie in a low-rank latent subspace of dimension $d_K \ll d_A$. First, we provide an offline algorithm to learn this subspace with provable guarantees. We then present two online algorithms that utilize the output of this offline algorithm to accelerate online learning. The first enjoys $\tilde O(\min(d_A\sqrt{T}, d_K\sqrt{T}(1+\sqrt{d_AT/d_KN})))$ regret guarantees, so that the effective dimension is lower when the size $N$ of the offline dataset is larger. We prove a matching lower bound on regret, showing that our algorithm is minimax optimal. The second is a practical algorithm that enjoys only a slightly weaker guarantee, but is computationally efficient. We also establish the efficacy of our methods using experiments on both synthetic data and real-life movie recommendation data from MovieLens. Finally, we theoretically establish the generality of the latent bandit model by proving a de Finetti theorem for stateless decision processes.
Paperid:962
Authors:Long Ma · Fangwei Zhong · Yizhou Wang
Abstract:
The ability to adapt to new environments with noisy dynamics and unseen objectives is crucial for AI agents. Incontext reinforcement learning (ICRL) has emerged as a paradigm to build adaptive policies, employing acontexttrajectory of the test-time interactions to infer the true task and the corresponding optimal policy efficiently without gradient updates. However, ICRL policies heavily rely on context trajectories, making them vulnerable to distribution shifts from training to testing and degrading performance, particularly in offline settings where the training data is static. In this paper, we highlight that most existing offline ICRL methods are trained for approximate Bayesian inference based on the training distribution, rendering them vulnerable to distribution shifts at test time and resulting in poor generalization. To address this, we introduce Behavior-agnostic Task Inference (BATI) for ICRL, a model-based maximum-likelihood solution to infer the task representation robustly. In contrast to previous methods that rely on a learned encoder as the approximate posterior, BATI focuses purely on dynamics, thus insulating itself against the behavior of the context collection policy. Experiments on MuJoCo environments demonstrate that BATI effectively interprets out-of-distribution contexts and outperforms other methods, even in the presence of significant environmental noise.
Paperid:963
Authors:Artem Moskalev · Mangal Prakash · Junjie Xu · Tianyu Cui · Rui Liao · Tommaso Mansi
Abstract:
Abstract:Processing global geometric context while preserving equivariance is crucial when modeling biological, chemical, and physical systems. Yet, this is challenging due to the computational demands of equivariance and global context at scale. Standard methods such as equivariant selfattention suffer from quadratic complexity, while local methods such as distance-based message passing sacrifice global information. Inspired by the recent success of state-space and long-convolutional models, we introduce Geometric Hyena, the first equivariant long-convolutional model for geometric systems. Geometric Hyena captures global geometric context at sub-quadratic complexity while maintaining equivariance to rotations and translations. Evaluated on all-atom property prediction of large RNA molecules and full protein molecular dynamics, Geometric Hyena outperforms existing equivariant models while requiring significantly less memory and compute that equivariant self-attention. Notably, our model processes the geometric context of $30k$ tokens $20 \times$ faster than the equivariant transformer and allows $72 \times$ longer context within the same budget.
Paperid:964
Authors:Kunwoong Kim · Jihu Lee · Sangchul Park · Yongdai Kim
Abstract:
Abstract:Algorithmic fairness in clustering aims to balance the proportions of instances assigned to each cluster with respect to a given sensitive attribute.While recently developed fair clustering algorithms optimize clustering objectives under specific fairness constraints, their inherent complexity or approximation often results in suboptimal clustering utility or numerical instability in practice.To resolve these limitations, we propose a new fair clustering algorithm based on a novel decomposition of the fair $K$means clustering objective function.The proposed algorithm, called Fair Clustering via Alignment (FCA), operates by alternately (i) finding a joint probability distribution to align the data from different protected groups, and (ii) optimizing cluster centers in the aligned space.A key advantage of FCA is that it theoretically guarantees approximately optimal clustering utility for any given fairness level without complex constraints, thereby enabling high-utility fair clustering in practice.Experiments show that FCA outperforms existing methods by (i) attaining a superior trade-off between fairness level and clustering utility, and (ii) achieving near-perfect fairness without numerical instability.
Paperid:965
Authors:Yanchao Tan · Hang Lv · Yunfei Zhan · Guofang Ma · Bo Xiong · Carl Yang
Abstract:
Language Models (LMs) have advanced diagnosis prediction by leveraging the semantic understanding of medical concepts in Electronic Health Records (EHRs). Despite these advancements, existing LMbased methods often fail to capture the structures of medical concepts (e.g., hierarchy structure from domain knowledge). In this paper, we propose BoxLM, a novel framework that unifies the structures and semantics of medical concepts for diagnosis prediction. Specifically, we propose a structure-semantic fusion mechanism via box embeddings, which integrates both ontology-driven and EHR-driven hierarchical structures with LM-based semantic embeddings, enabling interpretable medical concept representations.Furthermore, in the box-aware diagnosis prediction module, an evolve-and-memorize patient box learning mechanism is proposed to model the temporal dynamics of patient visits, and a volume-based similarity measurement is proposed to enable accurate diagnosis prediction. Extensive experiments demonstrate that BoxLM consistently outperforms state-of-the-art baselines, especially achieving strong performance in few-shot learning scenarios, showcasing its practical utility in real-world clinical settings.
Paperid:966
Authors:Stelios Triantafyllou · Aleksa Sukovic · Yasaman Zolfimoselo · Goran Radanovic
Abstract:
We address the challenge of explaining counterfactual outcomes in multiagent Markov decision processes. In particular, we aim to explain the total counterfactual effect of an agent's action on the outcome of a realized scenario through its influence on the environment dynamics and the agents' behavior. To achieve this, we introduce a novel causal explanation formula that decomposes the counterfactual effect by attributing to each agent and state variable a score reflecting their respective contributions to the effect. First, we show that the total counterfactual effect of an agent's action can be decomposed into two components: one measuring the effect that propagates through all subsequent agents' actions and another related to the effect that propagates through the state transitions. Building on recent advancements in causal contribution analysis, we further decompose these two effects as follows. For the former, we consider agent-specific effects -- a causal concept that quantifies the counterfactual effect of an agent's action that propagates through a subset of agents. Based on this notion, we use Shapley value to attribute the effect to individual agents. For the latter, we consider the concept of structure-preserving interventions and attribute the effect to state variables based on their "intrinsic'' contributions. Through extensive experimentation, we demonstrate the interpretability of our approach in a Gridworld environment with LLM-assisted agents and a sepsis management simulator.
Paperid:967
Authors:Cheng-Yen Hsieh · Xinyou Wang · Daiheng Zhang · Dongyu Xue · Fei YE · Shujian Huang · Zaixiang Zheng · Quanquan Gu
Abstract:
Multimodal protein language models (PLMs) integrate sequence and tokenbased structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks.To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling.The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models.
Paperid:968
Authors:Rei Higuchi · Taiji Suzuki
Abstract:
Aligning large language models (LLMs) with human preferences is crucial for safe deployment, yet existing methods assume specific preference models like BradleyTerry model.This assumption leads to statistical inconsistency, where more data doesn't guarantee convergence to true human preferences.To address this critical gap, we introduce a novel alignment method Direct Density Ratio Optimization (DDRO).DDRO directly estimates the density ratio between preferred and unpreferred output distributions, circumventing the need for explicit human preference modeling.We theoretically prove that DDRO is statistically consistent, ensuring convergence to the true preferred distribution as the data size grows, regardless of the underlying preference structure.Experiments demonstrate that DDRO achieves superior performance compared to existing methods, showcasing its effectiveness and potential for significant improvement.DDRO unlocks the potential for truly data-driven alignment, paving the way for more reliable and human-aligned LLMs.
Paperid:969
Authors:Satoshi Hayakawa · Yuhta Takida · Masaaki Imaizumi · Hiromi Wakaki · Yuki Mitsufuji
Abstract:
Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenges, particularly in capturing dependencies between elements (e.g., pixel relationships in image, sequential dependencies in language) mainly due to the computational cost of processing highdimensional joint distributions. In this paper, (i) we propose "mixture" models for discrete diffusion that are capable of treating dimensional correlations while remaining scalable, and (ii) we provide a set of loss functions for distilling the iterations of existing models. Two primary theoretical insights underpin our approach: First, conventional models with element-wise independence can well approximate the data distribution, but essentially requiremany sampling steps. Second, our loss functions enable the mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations. Our experimental results show the effectiveness of the proposed method in distilling pretrained discrete diffusion models across image and language domains.
Paperid:970
Authors:Yiran Qin · Zhelun Shi · Jiwen Yu · Xijun Wang · Enshen Zhou · Lijun Li · Zhenfei Yin · Xihui Liu · Lu Sheng · Jing Shao · LEI BAI · Ruimao Zhang
Abstract:
Recent advancements in predictive models have demonstrated exceptional capabilities in predicting the future state of objects and scenes. However, the lack of categorization based on inherent characteristics continues to hinder the progress of predictive model development. Additionally, existing benchmarks are unable to effectively evaluate highercapability, highly embodied predictive models from an embodied perspective. In this work, we classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks, covering three representative embodied scenarios: Open-Ended Embodied Environment, Autonomous, Driving, and Robot Manipulation. In the Explicit Perceptual Evaluation, we introduce the HF-Embodied Dataset, a video assessment dataset based on fine-grained human feedback, which we use to train a Human Preference Evaluator that aligns with human perception and explicitly assesses the visual fidelity of World Simulater. In the Implicit Manipulative Evaluation, we assess the video-action consistency of World Simulators by evaluating whether the generated situation-aware video can be accurately translated into the correct control signals in dynamic environments. Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.
Paperid:971
Authors:Kiwhan Song · Boyuan Chen · Max Simchowitz · Yilun Du · Russ Tedrake · Vincent Sitzmann
Abstract:
Classifierfree guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos. Website:anonymous URL
Paperid:972
Authors:Carles Balsells-Rodas · Xavier Sumba · Tanmayee Narendra · Ruibo Tu · Gabriele Schweikert · Hedvig Kjellström · Yingzhen Li
Abstract:
Causal discovery, i.e., inferring underlying causal relationships from observational data, is highly challenging for AI systems. In a time series modeling context, traditional causal discovery methods mainly consider constrained scenarios with fully observed variables and/or data from stationary timeseries. We develop a causal discovery approach to handle a wide class of nonstationary time series that areconditionally stationary, where the nonstationary behaviour is modeled as stationarity conditioned on a set of latent state variables. Named State-Dependent Causal Inference (SDCI), our approach is able to recover the underlying causal dependencies, with provable identifiablity for the state-dependent causal structures. Empirical experiments on nonlinear particle interaction data and gene regulatory networks demonstrate SDCI's superior performance over baseline causal discovery methods. Improved results over non-causal RNNs on modeling NBA player movements demonstrate the potential of our method and motivate the use of causality-driven methods for forecasting.
Paperid:973
Authors:Xi Weng · Jianing An · Xudong Ma · Binhang Qi · Jie Luo · Xi Yang · Jin Song Dong · Lei Huang
Abstract:
Selfsupervised learning (SSL) methods via joint embedding architectures have proven remarkably effective at capturing semantically rich representations with strong clustering properties, magically in the absence of label supervision. Despite this, few of them have explored leveraging these untapped properties to improve themselves. In this paper, we provide an evidence through various metrics that the encoder's outputencodingexhibits superior and more stable clustering properties compared to other components. Building on this insight, we propose a novel positive-feedback SSL method, termedRepresentationSoftAssignment (ReSA), which leverages the model's clustering properties to promote learning in a self-guided manner. Extensive experiments on standard SSL benchmarks reveal that models pretrained with ReSA outperform other state-of-the-art SSL methods by a significant margin. Finally, we analyze how ReSA facilitates better clustering properties, demonstrating that it effectively enhances clustering performance at both fine-grained and coarse-grained levels, shaping representations that are inherently more structured and semantically meaningful.
Paperid:974
Authors:Peihua Mai · Youlong Ding · Ziyan Lyu · Minxin Du · Yan (James) Pang
Abstract:
Federated recommender system (FedRec) has emerged as a solution to protect user data through collaborative training techniques. A typical FedRec involves transmitting the full model and entire weight updates between edge devices and the server, causing significant burdens to edge devices with limited bandwidth and computational power. The sparsity of embedding updates provides opportunity for payload optimization, while existing sparsityaware federated protocols generally sacrifice privacy for efficiency. A key challenge in designing a secure sparsity-aware efficient protocol is to protect the rated item indices from the server. In this paper, we propose a lossless secure recommender systems with on sparse embedding updates (SecEmb). SecEmb reduces user payload while ensuring that the server learns no information about both rated item indices and individual updates except the aggregated model. The protocol consists of two correlated modules: (1) a privacy-preserving embedding retrieval module that allows users to download relevant embeddings from the server, and (2) an update aggregation module that securely aggregates updates at the server. Empirical analysis demonstrates that SecEmb reduces both download and upload communication costs by up to 90x and decreases user-side computation time by up to 70x compared with secure FedRec protocols. Additionally, it offers non-negligible utility advantages compared with lossy message compression methods.
Paperid:975
Authors:Jesse Farebrother · Matteo Pirotta · Andrea Tirinzoni · REMI MUNOS · Alessandro Lazaric · Ahmed Touati
Abstract:
Predictive models of the future are fundamental for an agent's ability to reason and plan. A common strategy learns a dynamics or "world" model and unrolls it stepby-step at inference, where small errors can rapidly compound. Geometric Horizon Models (GHMs) offer a compelling alternative by directly making predictions of future states, avoiding cumulative inference errors. While GHMs can be conveniently learned by a generative analog to temporal difference (TD) learning, existing methods are negatively affected by bootstrapping predictions at train time and struggle to generate high-quality predictions for longer horizons. This paper introduces Temporal Difference Flows (TD-Flow), which leverages the structure of a novel Bellman equation on probability paths alongside flow-matching techniques to learn accurate GHMs at over 5x the horizon length of prior GHMs. Theoretically, we establish a new convergence result for the broader class of flow-matching, score-matching, and diffusion-based generative TD methods and primarily attribute TD-Flow's efficacy to reduced gradient variance during training. Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over pre-trained policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making.
Paperid:976
Authors:Joyentanuj Das · Suranjan De · He Sun
Abstract:
Graph clustering is an important algorithmic technique for analysing massive graphs, and has been widely applied in many research fields of data science. While the objective of most graph clustering algorithms is to find a vertex set of low conductance, there has been a sequence of recent studies that highlight the importance of the interconnection between clusters when analysing real-world datasets. Following this line of research, in this work we study bipartite-like clusters and present efficient and online algorithms that find such clusters in both undirected graphs and directed ones. We conduct experimental studies on both synthetic and real-world datasets, and show that our algorithms significantly speedup the running time of existing clustering algorithms while preserving their effectiveness.
Paperid:977
Authors:Omri Hemo · Alon Zolfi · Oryan Yehezkel · Omer Hofman · Roman Vainshtein · Hisashi Kojima · Yuval Elovici · Asaf Shabtai
Abstract:
Federated learning (FL) enables privacypreserving distributed machine learning by sharing gradients instead of raw data. However, FL remains vulnerable to gradient inversion attacks, in which shared gradients can reveal sensitive training data. Prior research has mainly concentrated on unimodal tasks, particularly image classification, examining the reconstruction of single-modality data and analyzing privacy vulnerabilities in these relatively simple scenarios. As multi-modal models are increasingly used to address complex vision and language tasks, it becomes essential to assess the privacy risks inherent in these architectures. In this paper, we explore gradient inversion attacks targeting multi-modal Document Visual Question Answering (DQA) models and propose GI-DQA, a novel method that reconstructs private document content from gradients. Through extensive evaluation on state-of-the-art DQA models, our approach exposes critical privacy vulnerabilities and highlights the urgent need for robust defenses to secure multi-modal FL systems.
Paperid:978
Authors:Thien Hang Nguyen · Huy Nguyen
Abstract:
Abstract:We introduce two complementary techniques for efficient optimization that reduce memory requirements while accelerating training oflargescale neural networks. The first technique, Subset-Norm step size, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) through step-size sharing. Subset-Norm (SN) reduces AdaGrad's memory footprint from $O(d)$ to $O(\sqrt{d})$, where $d$ is the model size. For non-convex smooth objectives under coordinate-wise sub-gaussian noise, we show a noise-adapted high-probability convergence guarantee with improved dimensional dependence of SN over existing methods. Our second technique, Subspace-Momentum, reduces the momentum state's memory footprint by restricting momentum to a low-dimensional subspace while performing SGD in the orthogonal complement. We prove high-probability convergence rates for Subspace-Momentum under standard assumptions. Empirical evaluation on pre-training and fine-tuning LLMs demonstrates the effectiveness of our methods. For instance, combining Subset-Norm with Subspace-Momentum achieves Adam's validation perplexity for LLaMA 1B in approximately half the trainingtokens (6.8B vs 13.1B) while reducing Adam's optimizer-states memory footprint by more than 80\% with minimal additional hyperparameter tuning.
Paperid:979
Authors:Maohao Shen · Guangtao Zeng · Zhenting Qi · Zhang-Wei Hong · Zhenfang Chen · Wei Lu · Gregory Wornell · Subhro Das · David Cox · Chuang Gan
Abstract:
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing testtime computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem:Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM?This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e.,an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models will be fully open-sourced.
Paperid:980
Authors:Mike Heddes · Adel Javanmard · Kyriakos Axiotis · Thomas Fu · MohammadHossein Bateni · Vahab Mirrokni
Abstract:
Transformer networks have achieved remarkable success across diverse domains, leveraging a variety of architectural innovations, including residual connections. However, traditional residual connections, which simply sum the outputs of previous layers, can dilute crucial information. This work introduces DeepCrossAttention (DCA), an approach that enhances residual learning in transformers. DCA employs learnable, inputdependent weights to dynamically combine layer outputs, enabling the model to selectively focus on the most relevant information in any of the previous layers. Furthermore, DCA incorporates depth-wise cross-attention, allowing for richer interactions between layers at different depths. Our language modeling experiments show that DCA achieves improved perplexity for a given training time. Moreover, DCA obtains the same model quality up to 3x faster while adding a negligible number of parameters (e.g., 0.2\%). Theoretical analysis confirms that DCA provides an improved trade-off between accuracy and model size when the ratio of collective layer ranks to the ambient dimension falls below a critical threshold.
Paperid:981
Authors:Cun-Yuan Xing · Meng-Zhang Qian · Wu-Yang Chen · Wei Gao · Zhi-Hua Zhou
Abstract:
Feature evolvable learning studies the scenario where old features will vanish and new features will emerge when learning with data streams, and various methods have been developed by utilizing some useful relationships from old features to new features, rather than retraining from scratch. This work focuses on two fundamental problems: How to characterize relationships between two different feature spaces, and how to exploit those relationships for feature evolvable learning.We introduce the Kernel Ortho-Mapping (KOM) discrepancy to characterize relationships between two different feature spaces via kernel functions, and correlate it with the optimal classifiers learned from different feature spaces. Based on such discrepancy, we develop the one-pass algorithm for feature evolvable learning, which requires going through all instances only once without storing the entire or partial training data. Our basic idea is to take online kernel learning with random Fourier features and incorporate some feature and label information for feature evolvable learning. We verify the effectiveness of our proposed method both theoretically and empirically.
Paperid:982
Authors:Dario Coscia · Max Welling · Nicola Demo · Gianluigi Rozza
Abstract:
Autoregressive and recurrent networks have achieved remarkable progress across various fields, from weather forecasting to molecular generation and Large Language Models. Despite their strong predictive capabilities, these models lack a rigorous framework for addressing uncertainty, which is key in scientific applications such as PDE solving, molecular generation and machine l earning Force Fields.To address this shortcoming we present BARNN: a variational Bayesian Autoregressive and Recurrent Neural Network. BARNNs aim to provide a principled way to turn any autoregressive or recurrent model into its Bayesian version. BARNN is based on the variational dropout method, allowing to apply it to large recurrent neural networks as well. We also introduce a temporal version of the“Variational Mixtures of Posteriors” prior (tVAMPprior) to make Bayesian inference efficient and well-calibrated. Extensive experiments on PDE modelling and molecular generation demonstrate that BARNN not only achieves comparable or superior accuracy compared to existing methods, but also excels in uncertainty quantification and modelling long-range dependencies.
Paperid:983
Authors:Taekyung Lee · Jaemoo Choi · Jaewoong Choi · Myungjoo Kang
Abstract:
Unpaired point cloud completion is crucial for realworld applications, where ground-truth data for complete point clouds are often unavailable. By learning a completion map from unpaired incomplete and complete point cloud data, this task avoids the reliance on paired datasets. In this paper, we propose the \textit{Unbalanced Optimal Transport Map for Unpaired Point Cloud Completion (\textbf{UOT-UPC})} model, which formulates the unpaired completion task as the (Unbalanced) Optimal Transport (OT) problem. Our method employs a Neural OT model learning the UOT map using neural networks. Our model is the first attempt to leverage UOT for unpaired point cloud completion, achieving competitive or superior performance on both single-category and multi-category benchmarks. In particular, our approach is especially robust under the class imbalance problem, which is frequently encountered in real-world unpaired point cloud completion scenarios.
Paperid:984
Authors:Renhao Lu
Abstract:
Recent advancements in deep neural networks have significantly enhanced the performance of semantic segmentation. However, class imbalance and instance imbalance remain persistent challenges, particularly in biomedical image analysis, where smaller instances and thin boundaries are often overshadowed by larger structures. To address the multiscale nature of segmented objects, various models have incorporated mechanisms such as spatial attention and feature pyramid networks. Despite these advancements, most loss functions are still primarily pixel-wise, while regional and boundary-focused loss functions often incur high computational costs or are restricted to small-scale regions. To address this limitation, we propose complex wavelet mutual information (CWMI) loss, a novel loss function that leverages mutual infor-mation from subband images decomposed by a complex steerable pyramid. Unlike discrete wavelet transforms, the complex steerable pyramid captures features across multiple orientations, remains robust to small translations, and preserves structural sim-ilarity across scales. Moreover, mutual information is well-suited for capturing high-dimensional direc-tional features and exhibits greater noise robustness compared to prior wavelet-based loss functions that rely on distance or angle metrics. Extensive ex-periments on diverse segmentation datasets demonstrate that CWMI loss achieves significant improvements in both pixel-wise accuracy and topological metrics compared to state-of-the-art methods, while introducing minimal computational overhead.
Paperid:985
Authors:Kazuto Fukuchi
Abstract:
We address the regression problem under the constraint of demographic parity, a commonly used fairness definition. Recent studies have revealed fair minimax optimal regression algorithms—the most accurate algorithms that adhere to the fairness constraint. However, these analyses are tightly coupled with specific data generation models. In this paper, we provide metatheorems that can be applied to various situations to validate the fair minimax optimality of the corresponding regression algorithms. Furthermore, we demonstrate that fair minimax optimal regression can be achieved through post-processing methods, allowing researchers and practitioners to focus on improving conventional regression techniques, which can then be efficiently adapted for fair regression.
Paperid:986
Authors:Katarzyna Kobalczyk · M van der Schaar
Abstract:
Reward modelling from preference data is a crucial step in aligning large language models (LLMs) with human values, requiring robust generalisation to novel promptresponse pairs. In this work, we propose to frame this problem in a causal paradigm, providing the rich toolbox of causality to identify the persistent challenges, such as causal misidentification, preference heterogeneity, and confounding due to user-specific factors. Inheriting from the literature of casual inference, we identify key assumptions necessary for reliable generalisation and contrast them with common data collection practices. We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness. Finally, we outline desiderata for future research and practices, advocating targeted interventions to address inherent limitations of observational data.
Paperid:987
Authors:Jianqing Liang · Zhiqiang Li · Xinkai Wei · Yuan Liu · Zhiqiang Wang
Abstract:
Abstract:Graph contrastive learning has attracted great interests as a dominant and promising selfsupervised representation learning approach in recent years. While existing works follow the basic principle that pulling positive pairs closer while pushing negative pairs far away, they still suffer from several critical problems, such as the underlying semantic disturbance brought by augmentation strategies, failure of GCN in capturing long-range dependence, rigidness and inefficiency of node sampling techniques. To address these issues, we propose Manifold Learning Inspired Lightweight Graph Contrastive Learning (ML$^2$-GCL), which inherits the merits of both manifold learning and GCN. ML$^2$-GCL avoids the potential risks of semantic disturbance with only one single view. It achieves global nonlinear structure recovery from locally linear fits, which can make up for the defects of GCN. The most amazing advantage is about the lightweightness due to its closed-form solution of positive pairs weights and removal of pairwise distances calculation. Theoretical analysis proves the existence of the optimal closed-form solution. Extensive empirical results on various benchmarks and evaluation protocols demonstrate effectiveness and lightweight of ML$^2$-GCL.
Paperid:988
Authors:Chi Zhang · Ziying Jia · George Atia · Sihong He · Yue Wang
Abstract:
Transfer reinforcement learning aims to derive a nearoptimal policy for a target environment with limited data by leveraging abundant data from related source domains. However, it faces two key challenges: the lack of performance guarantees for the transferred policy, which can lead to undesired actions, and the risk of negative transfer when multiple source domains are involved. We propose a novel framework based on the pessimism principle, which constructs and optimizes a conservative estimation of the target domain’s performance. Our framework effectively addresses the two challenges by providing an optimized lower bound on target performance, ensuring safe and reliable decisions, and by exhibiting monotonic improvement with respect to the quality of the source domains, thereby avoiding negative transfer. We construct two types of conservative estimations, rigorously characterize their effectiveness, and develop efficient distributed algorithms with convergence guarantees. Our framework provides a theoretically sound and practically robust solution for transfer learning in reinforcement learning.
Paperid:989
Authors:Delong Chen · Samuel Cahyawijaya · Jianfeng Liu · Baoyuan Wang · Pascale FUNG
Abstract:
Patchbased image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection--a simple task that can be handled well by a compact model--with watershed segmentation, which inherently guarantees no pixels are left unsegmented. Intrinsic evaluations across 5 datasets demonstrate that EPOC's segmentation aligns well with human annotations of both object- and part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens. Code and models will be made publicly available.
Paperid:990
Authors:Guibin Zhang · Luyang Niu · Junfeng Fang · Kun Wang · LEI BAI · Xiang Wang
Abstract:
Abstract:Large Language Model (LLM)empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the \textbf{agentic supernet}, a probabilistic and continuous distribution of agentic architectures. We introduce \textbf{MaAS}, an automated framework that samples query-dependent agentic systems from the supernet, delivering high-quality solutions and tailored resource allocation (\textit{e.g.}, LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS \textbf{(I)} requires only $6\\sim45\\%$ of the inference costs of existing handcrafted or automated multi-agent systems, \textbf{(II)} surpasses them by $0.54\\%\sim11.82\\%$, and \textbf{(III)} enjoys superior cross-dataset and cross-LLM-backbone transferability.
Paperid:991
Authors:Mason Kamb · Surya Ganguli
Abstract:
Abstract:We obtain an analytic, interpretable and predictive theory of creativity in convolutional diffusion models. Indeed, scorebased diffusion models can generate highly creative images that lie far from their training data. But optimal score-matching theory suggests that these models should only be able to produce memorized training examples. To reconcile this theory-experiment gap, we identify two simple inductive biases, locality and equivariance, that: (1) induce a form of combinatorial creativity by preventing optimal score-matching; (2) result in a fully analytic, completely mechanistically interpretable, equivariant local score (ELS) machine that, (3) without any training can quantitatively predict the outputs of trained convolution only diffusion models (like ResNets and UNets) with high accuracy (median $r^2$ of 0.90,0.91,0.94 on CIFAR10, FashionMNIST, and MNIST). Our ELS machine reveals a locally consistent patch mosaic model of creativity, in which diffusion models create exponentially many novel images by mixing and matching different local training set patches in different image locations. Our theory also partially predicts the outputs of pre-trained self-attention enabled UNets (median $r^2$ of 0.75 on CIFAR10), revealing an intriguing role for attention in carving out semantic coherence from local patch mosaics.
Paperid:992
Authors:Mark Vero · Niels Mündler · Viktor Chibotaru · Veselin Raychev · Maximilian Baader · Nikola Jovanović · Jingxuan He · Martin Vechev
Abstract:
The automatic generation of programs has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make edits given code context, and solve algorithmic coding tasks. To achieve full automation however, LLMs should be able to generate productionquality, self-contained application modules. To evaluate the capabilities of LLMs in solving this challenge, we introduce BaxBench, a novel evaluation benchmark consisting of 392 challenging tasks for the generation of backend applications. We focus on backends for three critical reasons: (i) they are practically relevant, building the core components of most modern web and cloud software, (ii) they are difficult to get right, requiring multiple functions and files to achieve the desired functionality, and (iii) they are security-critical, as they are exposed to untrusted third-parties, making secure solutions that prevent deployment-time attacks an imperative. BaxBench validates the functionality of the generated applications with comprehensive test cases, and assesses their security exposure by executing end-to-end exploits. Our experiments reveal key limitations of current LLMs in both functionality and security: (i) even the best model, OpenAI’s o1, achieves a mere 60% on code correctness; (ii) on average, we could successfully execute security exploits on more than half of the correct programs generated by LLMs; and (iii) in less popular backend frameworks, models further struggle to generate correct and secure applications. Progress on BaxBench signifies important steps toward autonomous and secure software development with LLMs.
Paperid:993
Authors:Ming Lin · Lin CHEN
Abstract:
Bloom filters (BF) are spaceefficient probabilistic data structures for approximate membership testing. Boosted by the proliferation of machine learning, learned Bloom filters (LBF) were recently proposed by augmenting the canonical BFs with a learned oracle as a pre-filter, the size of which is crucial to the compactness of the overall system. In this paper, inspired by ensemble learning, we depart from the state-of-the-art single-oracle LBF structure by demonstrating that, by leveraging multiple learning oracles of smaller size and carefully optimizing the accompanied backup filters, we can significantly boost the performance of LBF under the same space budget. We then design and optimize ensemble learned Bloom filters for mutually independent and correlated learning oracles respectively. We also empirically demonstrate the performance improvement of our propositions under three practical data analysis tasks.
Paperid:994
Authors:Hongbi ZHOU · Zhangkai NI
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis. However, existing methods struggle to adaptively optimize the distribution of Gaussian primitives based on scene characteristics, making it challenging to balance reconstruction quality and efficiency. Inspired by human perception, we propose sceneadaptive perceptual densification for Gaussian Splatting (Perceptual-GS), a novel framework that integrates perceptual sensitivity into the 3DGS training process to address this challenge. We first introduce a perception-aware representation that models human visual sensitivity while constraining the number of Gaussian primitives. Building on this foundation, we develop a perceptual sensitivity-adaptive distribution mechanism to allocate finer Gaussian granularity to visually critical regions, enhancing reconstruction quality and robustness. Extensive evaluations on multiple datasets, including BungeeNeRF for large-scale scenes, demonstrate that Perceptual-GS achieves state-of-the-art performance in reconstruction quality, efficiency, and robustness. The code will be made publicly available.
Paperid:995
Authors:Siqi Huang · Yanchen Xu · Hongyuan Zhang · Xuelong Li
Abstract:
Although graph contrastive learning (GCL) has been widely investigated, it is still a challenge to generate effective and stable graph augmentations. Existing methods often apply heuristic augmentation like random edge dropping, which may disrupt important graph structures and result in unstable GCL performance.In this paper, we proposePositiveincentiveNoise drivenGraphDataAugmentation (PiNGDA). where positive-incentive noise (pi-noise) scientifically analyzes the beneficial effect of noise under the information theory.To bridge the standard GCL and pi-noise framework, we design a Gaussian auxiliary variable to convert the loss function to information entropy. We prove that the standard GCL with pre-defined augmentations is equivalent to estimate the beneficial noise via the point estimation. Following our analysis, PiNGDA is derived from learning the beneficial noise on both topology and attributes through a trainable noise generator for graph augmentations, instead of the simple estimation.Since the generator learns how to produce beneficial perturbations on graph topology and node attributes, PiNGDA is more reliable compared with the existing methods. Extensive experimental results validate the effectiveness and stability of PiNGDA.
Paperid:996
Authors:Xuekai Zhu · Daixuan Cheng · Hengli Li · Kaiyan Zhang · Ermo Hua · Xingtai Lv · Ning Ding · Zhouhan Lin · Zilong Zheng · Bowen Zhou
Abstract:
Abstract:Model collapse in synthetic data indicates that iterative training on selfgenerated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-$\{n\}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.
Paperid:997
Authors:Vikram Kher · Manolis Zampetakis
Abstract:
When can the distributional assumptions of theorems and learning algorithms be trusted? Inspired by this question, Rubinfeld and Vasilyan (2023) initiated the study of testable learning. In this schema, we always learn one of the following two things: either we have achieved the desired accuracy regardless of whether the distributional assumptions are satisfied, or the input distribution does not satisfy the original distributional assumptions. Motivated by the challenge of relying on strong distributional assumptions in many theorems in mechanism design, we develop a testable learning framework for mechanism design. Traditional models in mechanism design assume that value distributions satisfy some notion of regularity. Unfortunately, testing regularity is not possible in the original testable learning framework as we show. To bypass this impossibility, we propose a regularized version of the testable learning framework. Under this framework, we always learn one of the following two things: either we achieve high revenue compared to the best possible revenue of any regular distribution close to the input distribution, or the input distribution does not satisfy regularity. We then use this framework to provide: 1) a testerlearner pair for revenue optimal mechanisms, 2) a tester for whether the fundamental Bulow-Klemperer Theorem (Bulow and Klemperer 1996) is applicable to a given dataset, and 3) a tester to confirm the existence of an anonymous reserve price that results in the anonymous price auction securing a constant fraction of the optimal revenue.
Paperid:998
Authors:Andrew Williams · Arjun Ashok · Étienne Marcotte · Valentina Zantedeschi · Jithendaraa Subramanian · Roland Riachi · James Requeima · Alexandre Lacoste · Irina Rish · Nicolas Chapados · Alexandre Drouin
Abstract:
Forecasting is a critical task in decisionmaking across numerous domains. While historical numerical data provide a start, they fail to convey the complete context for reliable and accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge and constraints, which can efficiently be communicated through natural language. However, in spite of recent progress with LLM-based forecasters, their ability to effectively integrate this textual information remains an open question. To address this, we introduce "Context is Key" (CiK), a time-series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities; crucially, every task in CiK requires understanding textual context to be solved successfully. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. This benchmark aims to advance multimodal forecasting by promoting models that are both accurate and accessible to decision-makers with varied technical expertise.The benchmark can be visualized at https://anon-forecast.github.io/benchmarkreportdev/.
Paperid:999
Authors:Sean Young
Abstract:
In recent years, the compression of large language models (LLMs) has emerged as a key problem in facilitating LLM deployment on resourcelimited devices, reducing compute costs, and mitigating the environmental footprint due to large-scale AI infrastructure. Here, we lay down the foundations of LLM quantization from a rate–distortion theory perspective and propose a quantization technique based on simple rate–distortion optimization. Our technique scales to models containing hundreds of billions of weight parameters and offers users the flexibility to compress models, post-training, to a model size or accuracy specified by the user.
Paperid:1000
Authors:sai surya · Inderjit Dhillon
Abstract:
Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dotproduct attention. This mechanism plays a crucial role in Transformer's performance. We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. This poor gradient signal backpropagation can lead to inefficient learning of parameters preceeding the attention operations. To this end, we introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that \laserattn can be implemented by making small modifications to existing attention implementations. We conduct experiments on autoregressive large language models (LLMs) with upto 2.2 billion parameters where we show upto 3.38% and an average of ~1% improvement over standard attention on downstream evaluations. Using LASER gives improvements in generalization performance across a variety of tasks (vision, text and speech):Vision Transformer (ViT) on Imagenet, Conformer on the Librispeech speech-to-text and BERT with 2.2 billion parameters.
Paperid:1001
Authors:Niket Patel · Guido Montufar
Abstract:
We define thelocal complexityof a neural network with continuous piecewise linear activations as a measure of the density of linear regions over an input data distribution. We show theoretically that ReLU networks that learn lowdimensional feature representations have a lower local complexity. This allows us to connect recent empirical observations on feature learning at the level of the weight matrices with concrete properties of the learned functions. In particular, we show that the local complexity serves as an upper bound on the total variation of the function over the input data distribution and thus that feature learning can be related to adversarial robustness. Lastly, we consider how optimization drives ReLU networks towards solutions with lower local complexity. Overall, this work contributes a theoretical framework towards relating geometric properties of ReLU networks to different aspects of learning such as feature learning and representation cost.
Paperid:1002
Authors:Abheek Ghosh · Tzeh Neoh · Nicholas Teh · Giannis Tyrovolas
Abstract:
We study a model of subscriptionbased platforms where users pay a fixed fee for unlimited access to content, and creators receive a share of the revenue. Existing approaches to detecting fraud predominantly rely on machine learning methods, engaging in an ongoing arms race with bad actors. We explore revenue division mechanisms that inherently disincentivize manipulation. We formalize three types of manipulation-resistance axioms and examine which existing rules satisfy these. We show that a mechanism widely used by streaming platforms, not only fails to prevent fraud, but also makes detecting manipulation computationally intractable. We also introduce a novel rule, ScaledUserProp, that satisfies all three manipulation-resistance axioms. Finally, experiments with both real-world and synthetic streaming data support ScaledUserProp as a fairer alternative compared to existing rules.
Paperid:1003
Authors:Menglin Xia · Victor Ruehle · Saravanakumar Rajmohan · Reza Shokri
Abstract:
How effectively can LLMbased AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights--failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models' abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, and performing basic operations when inputs are structured into distinct blocks, simulating real-world data. Additionally, we design composite tests to investigate the models' ability to maintain state while operating on memory. Our benchmark enables an interpretable, detailed assessment of memory capabilities of LLMs.
Paperid:1004
Authors:Zijian Liang · Kai Niu · Changshuo Wang · Jin Xu · Ping Zhang
Abstract:
Recent contributions of semantic information theory reveal the setelement relationship between semantic and syntactic information, represented as synonymous relationships. In this paper, we propose a synonymous variational inference (SVI) method based on this synonymity viewpoint to re-analyze the perceptual image compression problem. It takes perceptual similarity as a typical synonymous criterion to build an ideal synonymous set (Synset), and approximate the posterior of its latent synonymous representation with a parametric density by minimizing a partial semantic KL divergence. This analysis theoretically proves that the optimization direction of perception image compression follows a triple tradeoff that can cover the existing rate-distortion-perception schemes. Additionally, we introduce synonymous image compression (SIC), a new image compression scheme that corresponds to the analytical process of SVI, and implement a progressive SIC codec to fully leverage the model’s capabilities. Experimental results demonstrate comparable rate-distortion-perception performance using a single progressive SIC codec, thus verifying the effectiveness of our proposed analysis method.
Paperid:1005
Authors:Amirhesam Abedsoltan · Huaqing Zhang · Hongzhou Lin · Kaiyue Wen · Jingzhao Zhang · Misha Belkin
Abstract:
Large language models (LLMs) exhibit remarkable task generalization, solving tasks they were never explicitly trained on with only a few demonstrations. This raises a fundamental question: When can learning from a small set of tasks generalize to a large task family? In this paper, we investigate task generalization through the lens of autoregressive compositional structure, where each task is a composition of T operations, and each operation is among a finite family of D subtasks. This yields a total class of size~D^T. We first show that generalization to all D^T tasks is theoretically achievable by training on only \tilde{O}(D) tasks. Empirically, we demonstrate that Transformers achieve such exponential task generalization on sparse parity functions via Incontext Learning (ICL) and chain-of-thought (CoT) reasoning. We further demonstrate this exponential generalization in arithmetic and language translation, extending beyond parity functions.
Paperid:1006
Authors:Yuji Wang · Zehua Chen · Chen Xiaoyu · Yixiang Wei · Jun Zhu · Jianfei Chen
Abstract:
Diffusion models have achieved remarkable progress on imageto-video (I2V) generation, while their noise-to-data generation process is inherently mismatched with this task, which may lead to suboptimal synthesis quality. In this work, we present FrameBridge. By modeling the frame-to-frames generation process with a bridge model based data-to-data generative process, we are able to fully exploit the information contained in the given image and improve the consistency between the generation process and I2V task.Moreover, we propose two novel techniques toward the two popular settings of training I2V models, respectively. Firstly, we propose SNR-Aligned Fine-tuning (SAF), making the first attempt to fine-tune a diffusion model to a bridge model and, therefore, allowing us to utilize the pre-trained diffusion-based text-to-video (T2V) models. Secondly, we propose neural prior, further improving the synthesis quality of FrameBridge when training from scratch. Experiments conducted on WebVid-2M and UCF-101 demonstrate the superior quality of FrameBridge in comparison with the diffusion counterpart (zero-shot FVD 95 vs. 192 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101), and the advantages of our proposed SAF and neural prior for bridge-based I2V models. Demo samples can be visited at: https://framebridge-icml.github.io/
Paperid:1007
Authors:Xiangkun Hu · Lemin Kong · Tong He · David Wipf
Abstract:
The generated responses of large language models (LLMs) are often finetuned to human preferences through a process called reinforcement learning from human feedback (RLHF). As RLHF relies on a challenging training sequence, whereby a separate reward model is independently learned and then later applied to LLM policy updates, ongoing research effort has targeted more straightforward alternatives. In this regard, direct preference optimization (DPO) and its many offshoots circumvent the need for a separate reward training step. Instead, through the judicious use of a reparameterization trick that induces an implicit reward, DPO and related methods consolidate learning to the minimization of a single loss function. And yet despite demonstrable success in some real-world settings, we prove that DPO-based objectives are nonetheless subject to sub-optimal regularization and counter-intuitive interpolation behaviors, underappreciated artifacts of the reparameterizations upon which they are based. To this end, we introduce an explicit preference optimization framework termed EXPO that requires no analogous reparameterization to achieve an implicit reward. Quite differently, we merely posit intuitively-appealing regularization factors from scratch that transparently avoid the potential pitfalls of key DPO variants, provably satisfying regularization desiderata that prior methods do not. Empirical results serve to corroborate our analyses and showcase the efficacy of EXPO.
Paperid:1008
Authors:Juha Harviainen · Frank Sommer · Manuel Sorge · Stefan Szeider
Abstract:
Abstract:We present a comprehensive classical and parameterized complexity analysis of decision tree pruning operations, extending recent research on the complexity of learning small decision trees. Thereby, we offer new insights into the computational challenges of decision tree simplification, a crucial aspect of developing interpretable and efficient machine learning models. We focus on fundamental pruning operations of subtree replacement and raising, which are used in heuristic approaches. Surprisingly, while optimal pruning can be performed in polynomial time for subtree replacement, the problem is NPcomplete for subtree raising. Therefore, we identify parameters and combinations thereof that lead to fixed-parameter tractability or hardness, establishing a precise borderline between these complexity classes. For example, while subtree raising is hard for small domain size $D$ or number $d$ of features, it can be solved in $D^{2d} \cdot |I|^{O(1)}$ time, where $|I|$ is the input size. We complement our theoretical findings with preliminary experimental results, demonstrating the practical implications of our analysis.
Paperid:1009
Authors:Adrienne Tuynman · Rémy Degenne
Abstract:
In a fixedconfidence pure exploration problem in stochastic multi-armed bandits, an algorithm iteratively samples arms and should stop as early as possible and return the correct answer to a query about the arms distributions.We are interested in batched methods, which change their sampling behaviour only a few times, between batches of observations.We give an instance-dependent lower bound on the number of batches used by any sample efficient algorithm for any pure exploration task.We then give a general batched algorithm and prove upper bounds on its expected sample complexity and batch complexity.We illustrate both lower and upper bounds on best-arm identification and thresholding bandits.
Paperid:1010
Authors:Nikita Zozoulenko · Thomas Cass · Lukas Gonon
Abstract:
We introduce Random Feature Representation Boosting (RFRBoost), a novel method for constructing deep residual random feature neural networks (RFNNs) using boosting theory. RFRBoost uses random features at each layer to learn the functional gradient of the network representation, enhancing performance while preserving the convex optimization benefits of RFNNs. In the case of MSE loss, we obtain closedform solutions to greedy layer-wise boosting with random features. For general loss functions, we show that fitting random feature residual blocks reduces to solving a quadratically constrained least squares problem. We demonstrate, through numerical experiments on 91 tabular datasets for regression and classification, that RFRBoost significantly outperforms traditional RFNNs and end-to-end trained MLP ResNets, while offering substantial computational advantages and theoretical guarantees stemming from boosting theory.
Paperid:1011
Authors:Kexin Huang · Ying Jin · Ryan Li · Michael Li · Emmanuel J Candes · Jure Leskovec
Abstract:
Hypotheses are central to information acquisition, decisionmaking, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose Popper, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper's principle of falsification, Popper validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures.We demonstrate Popper on six domains including biology, economics, and sociology. Popper delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, Popper achieved comparable performance in validating complex biological hypotheses while reducing time by 10-fold, providing a scalable, rigorous solution for hypothesis validation.
Paperid:1012
Authors:Dongze Wu · Yao Xie
Abstract:
Sampling from highdimensional, multi-modal distributions remains a fundamental challenge across domains such as statistical Bayesian inference and physics-based machine learning. In this paper, we propose Annealing Flow (AF), a method built on Continuous Normalizing Flows (CNFs) for sampling from high-dimensional and multi-modal distributions. AF is trained with a dynamic Optimal Transport (OT) objective incorporating Wasserstein regularization, and guided by annealing procedures, facilitating effective exploration of modes in high-dimensional spaces. Compared to recent NF methods, AF significantly improves training efficiency and stability, with minimal reliance on MC assistance. We demonstrate the superior performance of AF compared to state-of-the-art methods through extensive experiments on various challenging distributions and real-world datasets, particularly in high-dimensional and multi-modal settings. We also highlight AF’s potential for sampling the least favorable distributions.
Paperid:1013
Authors:Wei Shen · Ruida Zhou · Jing Yang · Cong Shen
Abstract:
While transformers have demonstrated impressive capacities for incontext learning (ICL) in practice, theoretical understanding of the underlying mechanism enabling transformers to perform ICL is still in its infant stage. This work aims to theoretically study the training dynamics of transformers for in-context classification tasks. We demonstrate that, for in-context classification of Gaussian mixtures under certain assumptions, a single-layer transformer trained via gradient descent converges to a globally optimal model at a linear rate. We further quantify the impact of the training and testing prompt lengths on the ICL inference error of the trained transformer. We show that when the lengths of training and testing prompts are sufficiently large, the prediction of the trained transformer approaches the ground truth distribution of the labels.
Paperid:1014
Authors:TAK HUR · Daniel Kyungdeock Park
Abstract:
Understanding and improving generalization capabilities is crucial for both classical and quantum machine learning (QML). Recent studies have revealed shortcomings in current generalization theories, particularly those relying on uniform bounds, across both classical and quantum settings.In this work, we present a marginbased generalization bound for QML models, providing a more reliable framework for evaluating generalization.Our experimental studies on the quantum phase recognition dataset demonstrate that margin-based metrics are strong predictors of generalization performance, outperforming traditional metrics like parameter count. By connecting this margin-based metric to quantum information theory, we demonstrate how to enhance the generalization performance of QML through a classical-quantum hybrid approach when applied to classical data.
Paperid:1015
Authors:Tao Feng · Wei Li · Didi Zhu · Hangjie Yuan · Wendi Zheng · Dan Zhang · Jie Tang
Abstract:
Backpropagation provides a generalized configuration for overcoming catastrophic forgetting. Like, SGD and Adam are commonly used for weight updates in continual learning and continual pretraining. In practice, permission to access gradient information is not always granted (the gradient ban), such as black-box APIs, hardware limitations, and non-differentiable systems. To bridge this gap, we introduce the first benchmark ZeroFlow to evaluate gradient-free optimization algorithms for overcoming forgetting. This benchmark examines a suite of forward pass methods across multiple methods, forgetting scenarios, and datasets. We find that forward passes alone are enough to overcome forgetting. Our findings reveal new optimization principles that highlight the potential of forward-pass in mitigating forgetting, managing task conflicts, and reducing memory demands, alongside novel enhancements that further mitigate forgetting with just one forward pass. This work provides essential insights and tools for advancing forward pass methods to overcome forgetting. Code will be available upon publication.
Paperid:1016
Authors:Xuwei Xu · Yang Li · Yudong Chen · Jiajun LIU · Sen Wang
Abstract:
We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a critical opportunity for optimizing the efficiency of largescale ViTs by focusing on FFN layers. In this work, we propose a novel channel idle mechanism that facilitates post-training structural reparameterization for efficient FFN layers during testing. Specifically, a set of feature channels remains idle and bypasses the nonlinear activation function in each FFN layer, thereby forming a linear pathway that enables structural reparameterization during inference. This mechanism results in a family ofReParameterizableVisionTransformers (RePaViTs), which achieve remarkable latency reductions with acceptable sacrifices (sometimes gains) in accuracy across various ViTs. The benefits of our method scale consistently with model sizes, demonstrating greater speed improvements and progressively narrowing accuracy gaps or even higher accuracies on larger models. In particular, RePa-ViT-Large and RePa-ViT-Huge enjoy66.8%and68.7%speed-ups with+1.7%and+1.1%higher top-1 accuracies under the same training strategy, respectively. RePaViT is the first to employ structural reparameterization on FFN layers to expedite ViTs to our best knowledge, and we believe that it represents an auspicious direction for efficient ViTs.
Paperid:1017
Authors:Kaixuan Zhang · HuWang · Minxian Li · Mingwu Ren · Mao Ye · Xiatian Zhu
Abstract:
High Dynamic Range Novel View Synthesis (HDRNVS) aims to establish a 3D scene HDR model from Low Dynamic Range (LDR) imagery. Typically, multiple-exposure LDR images are employed to capture a wider range of brightness levels in a scene, as a single LDR image cannot represent both the brightest and darkest regions simultaneously. While effective, this multiple-exposure HDR-NVS approach has significant limitations, including susceptibility to motion artifacts (e.g., ghosting and blurring), high capture and storage costs.To overcome these challenges, we introduce, for the first time, the single-exposure HDR-NVS problem, where only single exposure LDR images are available during training. We further introduce a novel approach, Mono-HDR-3D, featuring two dedicated modules formulated by the LDR image formation principles, one for converting LDR colors to HDR counterparts, and the other for transforming HDR images to LDR format so that unsupervised learning is enabled in a closed loop. Designed as a meta-algorithm, our approach can be seamlessly integrated with existing NVS models. Extensive experiments show that Mono-HDR-3D significantly outperforms previous methods.Source code will be released.
Paperid:1018
Authors:Nitin Bisht · Xiuwen Gong · Guandong Xu
Abstract:
Although Recommender Systems (RS) have been welldeveloped for various fields of applications, they often suffer from a crisis of platform credibility with respect to RS confidence and fairness, which may drive users away, threatening the platform's long-term success. In recent years, some works have tried to solve these issues; however, they lack strong statistical guarantees. Therefore, there is an urgent need to solve both issues with a unifying framework with robust statistical guarantees. In this paper, we propose a novel and reliable framework called Equitable and Statistically Unbiased Recommendation (ENSUR)) to dynamically generate prediction sets for users across various groups, which are guaranteed 1) to include ground-truth items with user-predefined high confidence/probability (e.g., 90\%); 2) to ensure user fairness across different groups; 3) to have minimum efficient average prediction set sizes.We further design an efficient algorithm named Guaranteed User Fairness Algorithm (GUFA) to optimize the proposed method and derive upper bounds of risk and fairness metrics to speed up the optimization process.Moreover, we provide rigorous theoretical analysis concerning risk and fairness control and minimum set size. Extensive experiments validate the effectiveness of the proposed framework, which aligns with our theoretical analysis.
Paperid:1019
Authors:Minqin Zhu · Zexu Sun · Ruoxuan Xiong · Anpeng Wu · Baohong Li · Caizhi Tang · JUN ZHOU · Fei Wu · Kun Kuang
Abstract:
Uplift modeling is crucial for identifying individuals likely to respond to a treatment in applications like marketing and customer retention, but evaluating these models is challenging due to the inaccessibility of counterfactual outcomes in realworld settings.In this paper, we identify a fundamental limitation in existing evaluation metrics, such as the uplift and Qini curves, which fail to rank individuals with negative outcomes accurately.This can lead to biased evaluations, where biased models receive higher curve values than unbiased ones, resulting in suboptimal model selection.To address this, we propose the Principled Uplift Curve (PUC), a novel evaluation metric that assigns equal curve values of individuals with both positive and negative outcomes, offering a more balanced and unbiased assessment. We then derive the Principled Uplift Loss (PUL) function from the PUC and integrate it into a new uplift model, the Principled Treatment and Outcome Network (PTONet), to reduce bias during uplift model training.Experiments on both simulated and real-world datasets demonstrate that the PUC provides less biased evaluations, while PTONet outperforms existing methods.
Paperid:1020
Authors:YIQING LI · Yewei Xia · Xiaofei Wang · Zhengming Chen · Liuhua Peng · Mingming Gong · Kun Zhang
Abstract:
Discovering dependence patterns between variables from observational data is a fundamental issue in data analysis. However, existing testing methods often fail to detect subtle yet critical patterns that occur within small regions of the data distributionpatterns we term rare dependence. These rare dependencies obscure the true underlying dependence structure in variables, particularly in causal discovery tasks. To address this issue, we propose a novel testing method that combines kernel-based (conditional) independence testing with adaptive sample importance reweighting. By learning and assigning higher importance weights to data points exhibiting significant dependence, our method amplifies the patterns and can detect them successfully. Theoretically, we analyze the asymptotic distributions of the statistics in this method and show the uniform bound of the learning scheme. Furthermore, we integrate our tests into the PC algorithm, a constraint-based approach for causal discovery, equipping it to uncover causal relationships even in the presence of rare dependence. Empirical evaluation of synthetic and real-world datasets comprehensively demonstrates the efficacy of our method.
Paperid:1021
Authors:xihong yang · Siwei Wang · Fangdi Wang · Jiaqi Jin · Suyuan Liu · Yue Liu · En Zhu · Xinwang Liu · Yueming Jin
Abstract:
Leveraging the powerful representation learning capabilities, deep multiview clustering methods have demonstrated reliable performance by effectively integrating multi-source information from diverse views in recent years. Most existing methods rely on the assumption of clean views. However, noise is pervasive in real-world scenarios, leading to a significant degradation in performance. To tackle this problem, we propose a novel multi-view clustering framework for the automatic identification and rectification of noisy data, termed AIRMVC. Specifically, we reformulate noisy identification as an anomaly identification problem using GMM. We then design a hybrid rectification strategy to mitigate the adverse effects of noisy data based on the identification results. Furthermore, we introduce a noise-robust contrastive mechanism to generate reliable representations. Additionally, we provide a theoretical proof demonstrating that these representations can discard noisy information, thereby improving the performance of downstream tasks. Extensive experiments on six benchmark datasets demonstrate that AIRMVC outperforms state-of-the-art algorithms in terms of robustness in noisy scenarios.
Paperid:1022
Authors:Jinwu Hu · Zitian Zhang · Guohao Chen · Xutao Wen · Chao Shuai · Wei Luo · Bin Xiao · Yuanqing Li · Mingkui Tan
Abstract:
While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pretraining, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known as distribution shifts. In this paper, we propose a Test-Time Learning (TTL) paradigm for LLMs, namely TLM, which dynamically adapts LLMs to target domains using only unlabeled test data during testing. Specifically, we first provide empirical evidence and theoretical insights to reveal that more accurate predictions from LLMs can be achieved by minimizing the input perplexity of the unlabeled test data. Based on this insight, we formulate the Test-Time Learning process of LLMs as input perplexity minimization, enabling self-supervised enhancement of LLM performance. Furthermore, we observe that high-perplexity samples tend to be more informative for model optimization. Accordingly, we introduce a Sample Efficient Learning Strategy that actively selects and emphasizes these high-perplexity samples for test-time updates. Lastly, to mitigate catastrophic forgetting and ensure adaptation stability, we adopt Low-Rank Adaptation (LoRA) instead of full-parameter optimization, which allows lightweight model updates while preserving more original knowledge from the model. We introduce the AdaptEval benchmark for TTL and demonstrate through experiments that TLM improves performance by at least 20% compared to original LLMs on domain knowledge adaptation.
Paperid:1023
Authors:Anji Liu · Xuejie Liu · Dayuan Zhao · Mathias Niepert · Yitao Liang · Guy Van den Broeck
Abstract:
Nonautoregressive (NAR) generative models are valuable because they can handle diverse conditional generation tasks in a more principled way than their autoregressive (AR) counterparts, which are constrained by sequential dependency requirements. Recent advancements in NAR models, such as diffusion language models, have demonstrated superior performance in unconditional generation compared to AR models (e.g., GPTs) of similar sizes. However, such improvements do not always lead to improved conditional generation performance. We show that a key reason for this gap is the difficulty in generalizing to conditional probability queries unseen during training. As a result, strong unconditional generation performance does not guarantee high-quality conditional generation. This paper proposes Tractable Transformers (Tracformer), a Transformer-based generative model that is more robust to different conditional generation tasks. Unlike existing models that rely solely on global contextual features derived from full inputs, Tracformers incorporate a sparse Transformer encoder to capture both local and global contextual information. This information is routed through a decoder for conditional generation. Empirical results demonstrate that Tracformers achieve state-of-the-art conditional generation performance on text modeling compared to recent diffusion and AR model baselines.
Paperid:1024
Authors:Yu He · Ellen Vitercik
Abstract:
Neural Algorithmic Reasoning (NAR) trains neural networks to simulate classical algorithms, enabling structured and interpretable reasoning over complex data. While prior research has predominantly focused on learning exact algorithms for polynomialtime-solvable problems, extending NAR to harder problems remains an open challenge. In this work, we introduce a general NAR framework grounded in the primal-dual paradigm, a classical method for designing efficient approximation algorithms. By leveraging a bipartite representation between primal and dual variables, we establish an alignment between primal-dual algorithms and Graph Neural Networks. Furthermore, we incorporate optimal solutions from small instances to greatly enhance the model’s reasoning capabilities. Our empirical results demonstrate that our model not only simulates but also outperforms approximation algorithms for multiple tasks, exhibiting robust generalization to larger and out-of-distribution graphs. Moreover, we highlight the framework’s practical utility by integrating it with commercial solvers and applying it to real-world datasets. Our code can be found at \url{https://anonymous.4open.science/r/pdnar}.
Paperid:1025
Authors:Ankit Pratap Singh · Ahmed Abbasi · Namrata Vaswani
Abstract:
This work has two contributions. First, we introduce a provably secure (Byzantineresilient) sample- and communication-efficient alternating gradient descent (GD) and minimization based algorithms for solving the federated low rank matrix completion (LRMC) problem. This involves learning a low rank (LR) matrix from a small subset of its entries. Second, we extend our ideas to show how a special case of our algorithms also solves another partly-decoupled vertically federated LR matrix learning problem, that is LR column-wise sensing (LRCS), also referred to as multi-task linear representation learning in the literature. Finally, we also show how our results can be extended for the LR phase retrieval problem. In all problems, we consider column-wise or vertical federation, i.e. each node observes a small subset of entries of a disjoint column sub-matrix of the entire LR matrix. For the LRMC problem, horizontal federation is equivalent since it is symmetric across rows and columns; while for the other two it is not. In all problems, the data at different nodes is heterogeneous (not identically distributed), making it harder to obtain provable guarantees.
Paperid:1026
Authors:Annabelle Carrell · Albert Gong · Abhishek Shetty · Raaz Dwivedi · Lester Mackey
Abstract:
The goal in thinning is to summarize a dataset using a small set of representative points. Remarkably, subGaussian thinning algorithms like Kernel Halving and Compress can match the quality of uniform subsampling while substantially reducing the number of summary points. However, existing guarantees cover only a restricted range of distributions and kernel-based quality measures and suffer from pessimistic dimension dependence. To address these deficiencies, we introduce a new low-rank analysis of sub-Gaussian thinning that applies to any distribution and any kernel, guaranteeing high-quality compression whenever the kernel or data matrix is approximately low-rank. To demonstrate the broad applicability of the techniques, we design practical sub-Gaussian thinning approaches that improve upon the best known guarantees for approximating attention in transformers, accelerating stochastic gradient training through reordering, and distinguishing distributions in near-linear time.
Paperid:1027
Authors:Yang Yang · Haonan Xu
Abstract:
Outof-distribution (OOD) detection aims to ensure AI system reliability by rejecting inputs outside the training distribution. Recent work shows that memorizing atypical samples during later stages of training can hurt OOD detection, while strategies for forgetting them show promising improvements. However, directly forgetting atypical samples sacrifices ID generalization and limits the model's OOD detection capability. To address this issue, we propose Progressive Self-Knowledge Distillation (PSKD) framework, which strengthens the OOD detection capability by leveraging self-provided uncertainty-embedded targets. Specifically, PSKD adaptively selects a self-teacher model from the training history using pseudo-outliers, facilitating the learning of uncertainty knowledge via multi-level distillation applied to features and responses. As a result, PSKD achieves better ID generalization and uncertainty estimation for OOD detection. Moreover, PSKD is orthogonal to most existing methods and can be integrated as a plugin to collaborate with them. Extensive experiments verify the effectiveness of PSKD. The code is available at https://anonymous.4open.science/r/PSKD-6AF6.
Paperid:1028
Authors:Changdae Oh · zhen fang · Shawn Im · Xuefeng Du · Sharon Li
Abstract:
Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an informationtheoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies. Extensive experiments on real benchmark datasets, spanning 61 shift scenarios empirically validate our theoretical insights.
Paperid:1029
Authors:Peter Chang · Keyon Vafa · Ashesh Rambachan · Sendhil Mullainathan
Abstract:
Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler's predictions of planetary motion led to the discovery of fundamental physical laws. However, evaluating whether these models truly capture underlying principles remains a challenge. We present a framework for testing foundation models by analyzing how they adapt to synthetic tasks that share mechanisms with their training domain. We introduce metrics that measure a model's inductive bias toward known models of reality. Across multiple domains, we find that models can excel at their training tasks yet fail to develop inductive biases toward the true mechanisms when being adapted to new tasks. For instance, models trained on orbital trajectories fail to consistently apply Newtonian mechanics when adapted to new physics tasks. Our analysis reveals that rather than learning parsimonious representations that accord with reality, these models often develop taskspecific heuristics that don't generalize.
Paperid:1030
Authors:Shijian Zheng · Fangxiao Jin · Shuhai Zhang · Quan Xue · Mingkui Tan
Abstract:
Electromagnetic structure (EMS) design plays a critical role in developing advanced antennas and materials, but remains challenging due to highdimensional design spaces and expensive evaluations. While existing methods commonly employ high-quality predictors or generators to alleviate evaluations, they are often data-intensive and struggle with real-world scale and budget constraints. To address this, we propose a novel method called Progressive Quadtree-based Search (PQS). Rather than exhaustively exploring the high-dimensional space, PQS converts the conventional image-like layout into a quadtree-based hierarchical representation, enabling a progressive search from global patterns to local details. Furthermore, to lessen reliance on highly accurate predictors, we introduce a consistency-driven sample selection mechanism. This mechanism quantifies the reliability of predictions, balancing exploitation and exploration when selecting candidate designs. We evaluate PQS on two real-world engineering tasks, i.e., Dual-layer Frequency Selective Surface and High-gain Antenna. Experimental results show that our method can achieve satisfactory designs under limited computational budgets, outperforming baseline methods. In particular, compared to generative approaches, it cuts evaluation costs by 75∼85%, effectively saving 20∼48 days of product designing cycle.
Paperid:1031
Authors:Matthew Faw · Constantine Caramanis · Jessica Hoffmann
Abstract:
Abstract:Unconscious bias has been shown to influence how we assess our peers, with consequences for hiring, promotions and admissions. In this work, we focus on affinity bias, the component of unconscious bias which leads us to prefer people who are similar to us, despite no deliberate intention of favoritism. In a world where the people hired today become part of the hiring committee of tomorrow, we are particularly interested in understanding (and mitigating) how affinity bias affects this feedback loop. This problem has two distinctive features: 1) we only observe the _biased value_ of a candidate, but we want to optimize with respect to their _real value_ 2) the bias towards a candidate with a specific set of traits depends on the _fraction_ of people in the hiring committee with the same set of traits. We introduce a new bandits variant that exhibits those two features, which we call affinity bandits. Unsurprisingly, classical algorithms such as UCB often fail to identify the best arm in this setting. We prove a new instancedependent regret lower bound, which is larger than that in the standard bandit setting by a multiplicative function of $K$. Since we treat rewards that are _time-varying_ and _dependent on the policy’s past actions_, deriving this lower bound required developing proof techniques beyond the standard bandit techniques. Finally, we design an elimination-style algorithm which nearly matches this regret, despite never observing the real rewards.
Paperid:1032
Authors:Nona Rajabi · Antonio Ribeiro · Miguel Vasco · Farzaneh Taleb · Mårten Björkman · Danica Kragic
Abstract:
Decoding visual images from brain activity has significant potential for advancing braincomputer interaction and enhancing the understanding of human perception. Recent approaches align the representation spaces of images and brain activity to enable visual decoding. In this paper, we introduce the use of human-aligned image encoders to map brain signals to images. We hypothesize that these models more effectively capture perceptual attributes associated with the rapid visual stimuli presentations commonly used in visual brain data recording experiments. Our empirical results support this hypothesis, demonstrating that this simple modification improves image retrieval accuracy by up to 21\% compared to state-of-the-art methods. Comprehensive experiments confirm consistent performance improvements across diverse EEG architectures, image encoders, alignment methods, participants, and brain imaging modalities.
Paperid:1033
Authors:Yuxin Zhou · Zheng Li · Jun Zhang · Jue Wang · Yiping Wang · Zhongle Xie · Ke Chen · Lidan Shou
Abstract:
Abstract:With the widespread adoption of Mixtureof-Experts (MoE) models, there is a growing demand for efficient inference on memory-constrained devices.While offloading expert parameters to CPU memory and loading activated experts on demand has emerged as a potential solution, the large size of activated experts overburdens the limited PCIe bandwidth, hindering the effectiveness in latency-sensitive scenarios.To mitigate this, we propose FloE, an on-the-fly MoE inference system on memory-constrained GPUs.FloE is built on the insight that there exists substantial untapped redundancy within sparsely activated experts.It employs various compression techniques on the expert's internal parameter matrices to reduce the data movement load, combined with low-cost sparse prediction, achieving perceptible inference acceleration in wall-clock time on resource-constrained devices.Empirically, FloE achieves a 9.3$\times$ compression of parameters per expert in Mixtral-8$\times$7B; enables deployment on a GPU with only 11GB VRAM, reducing the memory footprint by up to 8.5$\times$; and delivers a 48.7$\times$ inference speedup compared to DeepSpeed-MII on a single GeForce RTX 3090—all with only a 4.4\% $\sim$ 7.6\% average performance degradation.
Paperid:1034
Authors:Haitao Wu · Weiwei Li · Xiuyi Jia
Abstract:
Abstract:Label distribution learning (LDL) is a novel learning paradigm that emulates label polysemy by assigning label distributions over the label space. However, recent LDL work seems to exhibit a notable contradiction: 1) existing LDL methods employ auxiliary tasks to enhance performance, which narrows their focus to specific applications, thereby lacking generalizability; 2) conversely, LDL methods without auxiliary tasks rely on losses tailored solely to the primary task, lacking additional beneficial data to guide the learning process. In this paper, we propose $\mathcal{S}$LDL, a novel and minimalist solution that generates subtask label distributions, i.e., a form of extra supervised information, to reconcile the above contradiction. $\mathcal{S}$-LDL encompasses two key aspects: 1) an algorithm capable of generating subtasks without any prior/expert knowledge; and 2) a plug-and-play framework seamlessly compatible with existing LDL methods, and even adaptable to derivative tasks of LDL. Our analysis and experiments demonstrate that $\mathcal{S}$-LDL is effective and efficient. To the best of our knowledge, this paper represents the first endeavor to address LDL via subtasks. The code will soon be available on GitHub to facilitate reproducible research.
Paperid:1035
Authors:Chang Liu · Yixin Wang · Moontae Lee
Abstract:
Efficient access to highquality information is vital for online platforms. To promote more useful information, users not only create new content, but also evaluate existing content often through helpfulness voting. While aggregated votes help service providers rank their user content, these votes are often biased by disparate accessibility per positions and cascaded influence of prior votes. For fairer assessment of information quality, we propose the Counterfactual Voting Adjustment (CVA), a causal framework that account for the context in which individual votes are cast. Through preliminary and semi-synthetic experiments, we show that CVA effectively models the position and herding biases, accurately recovering the predefined content quality. In real experiment, we demonstrate that reranking content based on the learned quality by CVA exhibits stronger alignment with both user sentiment and quality evaluation assessed by GPT-4o, outperforming system rankings based on aggregated votes and model-based rerankings without causal inference. Beyond the individual quality inference, our embeddings offer comparative insights into the behavioral dynamics of expert user groups across major 120 StackExchange communities.
Paperid:1036
Authors:Armin Kekić · Sergio Garrido Mejia · Bernhard Schölkopf
Abstract:
Estimating causal effects of joint interventions on multiple variables is crucial in many domains, but obtaining data from such simultaneous interventions can be challenging. Our study explores how to learn joint interventional effects using only observational data and singlevariable interventions. We present an identifiability result for this problem, showing that for a class of nonlinear additive outcome mechanisms, joint effects can be inferred without access to joint interventional data. We propose a practical estimator that decomposes the causal effect into confounded and unconfounded contributions for each intervention variable. Experiments on synthetic data demonstrate that our method achieves performance comparable to models trained directly on joint interventional data, outperforming a purely observational estimator.
Paperid:1037
Authors:Shurui Gui · Xiner Li · Shuiwang Ji
Abstract:
Abstract:We consider learning underlying laws of dynamical systems governed by ordinary differential equations (ODE). A key challenge is how to discover intrinsic dynamics across multiple environments while circumventing environmentspecific mechanisms. Unlike prior work, we tackle more complex environments where changes extend beyond function coefficients to entirely different function forms. For example, we demonstrate the discovery of ideal pendulum's natural motion $\alpha^2 \sin{\theta_t}$ by observing pendulum dynamics in different environments, such as the damped environment $\alpha^2 \sin(\theta_t) - \rho \omega_t$ and powered environment $\alpha^2 \sin(\theta_t) + \rho \frac{\omega_t}{\left|\omega_t\right|}$. Here, we formulate this problem as an *invariant function learning* task and propose a new method, known as **D**isentanglement of **I**nvariant **F**unctions (DIF), that is grounded in causal analysis. We propose a causal graph and design an encoder-decoder hypernetwork that explicitly disentangles invariant functions from environment-specific dynamics. The discovery of invariant functions is guaranteed by our information-based principle that enforces the independence between extracted invariant functions and environments. Quantitative comparisons with meta-learning and invariant learning baselines on three ODE systems demonstrate the effectiveness and efficiency of our method. Furthermore, symbolic regression explanation results highlight the ability of our framework to uncover intrinsic laws.
Paperid:1038
Authors:Pierre-Louis Cauvin · Davide Legacci · Panayotis Mertikopoulos
Abstract:
In this paper, we investigate how randomness and uncertainty influence learning in games. Specifically, we examine a perturbed variant of the dynamics of "followthe-regularized-leader", where the players' payoff observations and strategy updates are continually affected by random shocks. Our findings reveal that, in a fairly precise sense, "uncertainty favors extremes": inanygame, regardless of the noise level, every player's trajectory of play reaches an arbitrarily small neighborhood of a pure strategy in finite time (which we estimate). Moreover, even if the player does not settle at this strategy, they return arbitrarily close to some (possibly different) pure strategy infinitely often. This prompts the question of which sets of pure strategies emerge as robust predictions of learning in the presence of noise and uncertainty. In this regard, we show that (a) the only possible limits of the FTRL dynamics under uncertainty are pure Nash equilibria; and (b) a span of pure strategies is stable and attracting if and only if it isclosed under better replies. Finally, we specialize our analysis to games where the dynamics are recurrent in the deterministic setting, such as zero-sum games with an interior equilibrium. In this case, we show that the stochastic dynamics drift toward the boundary on average, thus disrupting the quasi-periodic behavior observed in the noiseless, deterministic regime.
Paperid:1039
Authors:Jicong Fan
Abstract:
Measuring the distance or similarity between graphs is the foundation of many graph analysis tasks such as graph clustering but remains a challenge on large datasets. In this work, we treat the adjacency matrices of two graphs as two kernel matrices given by some unknown indefinite kernel function performed on two discrete distributions and define the distance between the two distributions as a measure, called MMFD, of the dissimilarity between two graphs. We show that MMFD is a pseudo metric. Although the initial definition of MMFD seems complex, we show that it has a closedform solution with extremely simple computation. To further improve the efficiency of large-scale clustering, we propose an MMFD-KM with linear space and time complexity with respect to the number of graphs. We also provide a generalization of MMFD, called MFD, which is more effective in exploiting the information of factors of adjacency matrices. The experiments on simulated graphs intuitively show that our methods are effective in comparing graphs. The experiments on real-world datasets demonstrate that, compared to the competitors, our methods have much better clustering performance in terms of three evaluation metrics and time cost.
Paperid:1040
Authors:Walid Durani · Tobias Nitzl · Claudia Plant · Christian Böhm
Abstract:
Detecting anomalies with limited supervision is challenging due to the scarcity of labeled anomalies, which often fail to capture the diversity of abnormal behaviors. We propose Weakly Supervised Anomaly Detection via DualTailed Kernel (WSAD-DT), a novel framework that learns robust latent representations to distinctly separate anomalies from normal samples under weak supervision. WSAD-DT introduces two centroids—one for normal samples and one for anomalies—and leverages a dual-tailed kernel scheme: a light-tailed kernel to compactly model in-class points and a heavy-tailed kernel to main- tain a wider margin against out-of-class instances. To preserve intra-class diversity, WSAD-DT in- corporates kernel-based regularization, encouraging richer representations within each class. Furthermore, we devise an ensemble strategy that partition unlabeled data into diverse subsets, while sharing the limited labeled anomalies among these partitions to maximize their impact. Empirically, WSAD-DT achieves state-of-the-art performance on several challenging anomaly detection benchmarks, outperforming leading ensemble-based methods such as XGBOD.
Paperid:1041
Authors:Chengyuan Li · Liangxiao Jiang · Wenjun Zhang · Liangjun Yu · Huan Zhang
Abstract:
Due to its simplicity, effectiveness and robustness, naive Bayes (NB) has continued to be one of the top 10 data mining algorithms. To improve its performance, a large number of improved algorithms have been proposed in the last few decades. However, in addition to Gaussian naive Bayes (GNB), there is little work on numerical attributes. At the same time, none of them takes into account the correlations among instances. To fill this gap, we propose a novel algorithm called instance correlation graphbased naive Bayes (ICGNB). Specifically, it first uses original attributes to construct an instance correlation graph (ICG) to represent the correlations among instances. Then, it employs a variational graph auto-encoder (VGAE) to generate new attributes from the constructed ICG and uses them to augment original attributes.Finally, it weights each augmented attribute to alleviate the attribute redundancy and builds GNB on the weighted attributes. The experimental results on tens of datasets show that ICGNB significantly outperforms its deserved competitors.
Paperid:1042
Authors:Conor Snedeker · Xinyu Zhou · Raef Bassily
Abstract:
Abstract:We study model personalization under userlevel differential privacy (DP) in the shared representation framework. In this problem, there are $n$ users whose data is statistically heterogeneous and their optimal parameters $(w_1^*,\dots, w_n^*)\in \mathbb{R}^{d\times n}$ share an unknown embedding $U^* \in\mathbb{R}^{d\times k}$, where $k\ll d$; i.e., $w_i^* = U^* v_i^*, i \in [n]$, where $v_i^*\in \mathbb{R}^k$ is a low-dimensional representation of $w_i^*$. Our goal is to privately recover the embedding $U^*$ and the local vectors $V^* = [v_1^*, \ldots, v_n^*]^\top$ with small excess risk in the federated setting. We propose a private, efficient federated learning algorithm to learn the shared embedding based on the FedRep algorithm in (Collins et al., 2021). Unlike (Collins et al., 2021), our algorithm satisfies differential privacy and our results hold for the case of noisy labels. In contrast to prior work on private model personalization (Jain et al., 2021), our utility guarantees hold under a larger class of users' distributions (sub-Gaussian instead of Gaussian distributions). Additionally, in the regime where $\max_{i\in [n]} \lVert v_i^* \rVert_2$ and the condition number of $ \frac{1}{\sqrt{n}}V^*$ are $\widetilde{O}(1)$, we improve the privacy error term in (Jain et al., 2021) by a factor of $\widetilde{O}(dk)$. Next, we consider the binary classification setting. We present an information-theoretic construction to privately learn the shared embedding and derive a margin-based accuracy guarantee that is independent of $d$. Our method utilizes the Johnson-Lindenstrauss transform to reduce the effective dimensions of the shared embedding and the users' data. This result shows that dimension-independent risk bounds are possible in this setting under a margin loss.
Paperid:1043
Authors:Jongin Lim · Sucheol Lee · Daeho Um · Sung-Un Park · Jinwoo Shin
Abstract:
Data imbalance is a prominent, yet longstanding challenge in most real-world machine learning scenarios. However, prior studies on data imbalance have primarily focused on classification setups, leaving imbalanced regression underexplored despite its importance in many applications. To bridge this gap, we introduce Proxy-based Representation learning for IMbalanced rEgression (PRIME). The key idea of PRIME is to utilize proxies as global reference points for a balanced and well-ordered feature distribution and to align sample features with these proxies. Unlike previous representation learning methods that rely solely on sample relationships within a batch, PRIME leverages rich sample-proxy relationships, offering holistic guidance for learning the desired representations. Moreover, aligning each feature with its corresponding proxy resembles classification, allowing the seamless application of any loss-based class imbalance techniques to promote balanced feature learning. Extensive experiments validate the effectiveness and broad applicability of PRIME, demonstrating state-of-the-art performance on four real-world regression benchmarks across diverse target domains.
Paperid:1044
Authors:Andong Wang · Yuning Qiu · Zhong Jin · Guoxu Zhou · Qibin Zhao
Abstract:
Tensor regression is a vital tool for analyzing complex multidimensional data in fields like neuroimaging and spatiotemporal analysis.However, its effectiveness is often constrained by limited sample sizes, where the scarcity of observations hinders accurate parameter estimation. To overcome this limitation, we adopt a transfer learning strategy that leverages knowledge from related source tasks to enhance performance in data-scarce target tasks. However, this approach brings additional challenges, such as model shifts, covariate shifts, and decentralized data management, which complicate effective knowledge transfer. We propose the Low-Rank Tensor Transitions (LoRT) framework to systematically address these challenges. LoRT employs a novel fusion regularizer to enforce low-tubal-rank solutions, enabling effective integration of source and target knowledge while adapting to distributional shifts. Its two-step refinement process mitigates model shifts and ensures robust adaptation to target tasks. To support decentralized scenarios, we extend LoRT to D-LoRT, a distributed variant that preserves statistical efficiency while minimizing communication overhead. Theoretical analysis establishes the conditions for effective knowledge transfer in LoRT, and experiments on tensor regression tasks, including compressed sensing and completion, validate its robustness and versatility. These results position LoRT as a robust solution to the challenges of tensor regression with limited data and complex distributions.
Paperid:1045
Authors:Mikel Malagón · Josu Ceberio · Jose A Lozano
Abstract:
Advances in large models, reinforcement learning, and openendedness have accelerated progress toward autonomous agents that can learn and interact in the real world. To achieve this, flexible tools are needed to create rich, yet computationally efficient, environments. While scalable 2D environments fail to address key real-world challenges like 3D navigation and spatial reasoning, more complex 3D environments are computationally expensive and lack features like customizability and multi-agent support. This paper introduces Craftium, a highly customizable and easy-to-use platform for building rich 3D single- and multi-agent environments. We showcase environments of different complexity and nature: from single- and multi-agent tasks to vast worlds with many creatures and biomes and customizable procedural task generators. Benchmarking shows that Craftium significantly reduces the computational cost of alternatives of similar richness, achieving +2K steps per second more than Micraft-based frameworks.
Paperid:1046
Authors:Zhihe Yang · Yunjian Xu · Yang Zhang
Abstract:
Safe offline reinforcement learning (RL), which aims to learn the safetyguaranteed policy without risky online interaction with environments, has attracted growing recent attention for safety-critical scenarios. However, existing approaches encounter out-of-distribution problems during the testing phase, which can result in potentially unsafe outcomes. This issue arises due to the infinite possible combinations of reward-related and cost-related states. In this work, we propose State Decoupling with Q-supervised Contrastive representation (SDQC), a novel framework that decouples the global observations into reward- and cost-related representations for decision-making, thereby improving the generalization capability for unfamiliar global observations. Compared with the classical representation learning methods, which typically require model-based estimation (e.g., bisimulation), we theoretically prove that our Q-supervised method generates a coarser representation while preserving the optimal policy, resulting in improved generalization performance. Experiments on DSRL benchmark problems provide compelling evidence that SDQC surpasses other baseline algorithms, especially for its exceptional ability to achieve almost zero violations in more than half of the tasks, while the state-of-the-art algorithm can only achieve the same level of success in a quarter of the tasks. Further, we demonstrate that SDQC possesses superior generalization ability when confronted with unseen environments.
Paperid:1047
Authors:I-Chun Chen · Hsu-Shen Liu · Wei-Fang Sun · Chen-Hao Chao · Yen-Chang Hsu · Chun-Yi Lee
Abstract:
Sparse Mixtureof-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC- SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large-scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE’s effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE’s superior performance and practical applicability for real-world deployments.
Paperid:1048
Authors:Xinyue Zheng · Haowei Lin · Kaichen He · Zihao Wang · Zilong Zheng · Qiang Fu · Haobo Fu · Yitao Liang
Abstract:
Developing AI agents capable of interacting with openworld environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce \textit{Minecraft Universe} (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5\% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments.
Paperid:1049
Authors:Mengmeng Chen · Xiaohu Wu · QIQI LIU · Tiantian He · Yew Soon ONG · Yaochu Jin · Qicheng Lao · Han Yu
Abstract:
Multiobjective optimization (MOO) exists extensively in machine learning, and aims to find a set of Pareto-optimal solutions, called the Pareto front, e.g., it is fundamental for multiple avenues of research in federated learning (FL). Pareto-Front Learning (PFL) is a powerful method implemented using Hypernetworks (PHNs) to approximate the Pareto front. This method enables the acquisition of a mapping function from a given preference vector to the solutions on the Pareto front. However, most existing PFL approaches still face two challenges: (a) sampling rays in high-dimensional spaces; (b) failing to cover the entire Pareto Front which has a convex shape. Here, we introduce a novel PFL framework, called as PHN-HVVS, which decomposes the design space into Voronoi grids and deploys a genetic algorithm (GA) for Voronoi grid partitioning within high-dimensional space. We put forward a new loss function, which effectively contributes to more extensive coverage of the resultant Pareto front and maximizes the HV Indicator. Experimental results on multiple MOO machine learning tasks demonstrate that PHN-HVVS outperforms the baselines significantly in generating Pareto front. Also, we illustrate that PHN-HVVS advances the methodologies of several recent problems in the FL field.
Paperid:1050
Authors:Marwa Abdulhai · Isadora White · Charlie Snell · Charles Sun · Joey Hong · Yuexiang Zhai · Kelvin Xu · Sergey Levine
Abstract:
Large language models (LLMs) provide excellent textgeneration capabilities, but standard prompting and generation methods generally do not lead to intentional or goal-directed agents and might necessitate considerable prompt tuning. Even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions that lead to better decisions after multiple turns. Reinforcement learning has the potential to leverage the powerful modeling capabilities of LLMs, as well as their internal representation of textual interactions, to create capable goal-directed language agents. This can enable intentional and temporally extended interactions, such as with humans, the emergence of complex skills such as persuasion, and long-horizon strategic behavior, such as in the context of games. Enabling this requires the community to develop reliable reinforcement learning algorithms for training LLMs. Developing such algorithms requires tasks that can gauge progress on algorithm design, provide accessible and reproducible evaluations for multi-turn interactions, and cover a range of task properties and challenges in improving reinforcement learning algorithms. Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework for getting started on multi-turn RL with offline value-based and online policy-based RL methods. Our benchmark consists of 3 Interactive Dialogue tasks and 5 RL Capability tests for a total of 8 tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
Paperid:1051
Authors:Glory Rongyu CHEN · Li'an Zhuo · Linlin Yang · Qi WANG · Bang Zhang · Angela Yao · Liefeng Bo
Abstract:
Vision Transformers (ViT) are remarkable at 3D pose estimation, yet they still encounter certain challenges. One issue is that the popular ViT architecture for pose estimation is limited to images and lacks temporal information. Another challenge is that the prediction often fails to maintain pixel alignment with the original images. To address these issues, we propose a systematic framework for 3D pose estimation, which we call ExtPose. ExtPose extends image ViT by taking in additional 2D pose evidence and captures temporal information in a full attentionbased manner. We use 2D human skeleton images to integrate structured 2D pose information. By sharing parameters and attending across the modality, we enhance the consistency between 3D poses and 2D images without introducing additional parameters. We achieve state-of-the-art (SOTA) performance on multiple human and hand pose estimation benchmarks with substantial improvements to 34.0mm (-23%) on 3DPW and 4.9mm (-18%) on FreiHAND in PA-MPJPE over the other ViT-based methods respectively.
Paperid:1052
Authors:Marco Bagatella · Jonas Hübotter · Georg Martius · Andreas Krause
Abstract:
Pretrained generalist policies are rapidly gaining relevance in robot learning due to their promise of fast adaptation to novel, in-domain tasks.This adaptation often relies on collecting new demonstrations for a specific task of interest and applying imitation learning algorithms, such as behavioral cloning.However, as soon as several tasks need to be learned, we must decidewhich tasks should be demonstrated and how often?We study this multi-task problem and explore an interactive framework in which the agentadaptivelyselects the tasks to be demonstrated.We propose AMF (Active Multi-task Fine-tuning), an algorithm to maximize multi-task policy performance under a limited demonstration budget by collecting demonstrations yielding the largest information gain on the expert policy.We derive performance guarantees for AMF under regularity assumptions and demonstrate its empirical effectiveness to efficiently fine-tune neural policies in complex and high-dimensional environments.
Paperid:1053
Authors:Guoxin Chen · Minpeng Liao · Peiying Yu · Dingmin Wang · Zile Qiao · Chao Yang · Xin Zhao · Kai Fan
Abstract:
Retrievalaugmented generation (RAG) systems face a fundamental challenge in aligning independently developed retrievers and large language models (LLMs). Existing approaches typically involve modifying either component or introducing simple intermediate modules, resulting in practical limitations and sub-optimal performance. Inspired by human search behavior—typically involving a back-and-forth process of proposing search queries and reviewing documents, we propose C-3PO, a proxy-centric framework that facilitates communication between retrievers and LLMs through a lightweight multi-agent system. Our framework implements three specialized agents that collaboratively optimize the entire RAG pipeline without altering the retriever and LLMs. These agents work together to assess the need for retrieval, generate effective queries, and select information suitable for the LLMs. To enable effective multi-agent coordination, we develop a tree-structured rollout approach for reward credit assignment in reinforcement learning. Extensive experiments in both in-domain and out-of-distribution scenarios demonstrate that C-3PO significantly enhances RAG performance while maintaining plug-and-play flexibility and superior generalization capabilities.
Paperid:1054
Authors:Michal Lukasik · Lin Chen · Harikrishna Narasimhan · Aditya Menon · Wittawat Jitkrittum · Felix Xinnan Yu · Sashank J. Reddi · Thomas Fu · MohammadHossein Bateni · Sanjiv Kumar
Abstract:
Bipartite ranking is a fundamental supervised learning problem, with applications ranging from information retrieval to medical diagnosis. Here, the goal is to learn a ranking over instances with maximal AUC against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem loss aggregation and label aggregation --- by characterizing their Bayes-optimal solutions. Based on this, we show that while both methods can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify.
Paperid:1055
Authors:Jiujun He · Huazhen Lin
Abstract:
Abstract:Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose an efficient pruning framework for LLMs called Orthogonal Neuron Decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multihead attention (MHA) layer depends on two types of matrix products (i.e., ${\rm W}_q{\rm W}^{\top}_k$ and ${\rm W}_v{\rm W}^{\top}_o$). By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. Moreover, a fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads. Additionally, to mitigate error accumulation problem caused by pruning the feed-forward network (FFN) layer, we introduce a linear calibration method to reconstruct the residual errors of a pruned layer using two low-rank matrices. By leveraging singular value decomposition (SVD) on the solution of the least-squares problem, these matrices are obtained without requiring retraining. Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks. For example, it requires only hundreds of samples, several minutes of runtime, and just a single 16GB GPU to prune a LLaMA-7B model with outstanding performance compared to existing pruning methods.
Paperid:1056
Authors:Deborah Hendrych · Sebastian Pokutta · Mathieu Besançon · David Martinez-Rubio
Abstract:
We present a new stepsize strategy based on the secant method for Frank-Wolfe algorithms. This strategy, which requires mild assumptions about the function under consideration, can be applied to any Frank-Wolfe algorithm. It is as effective as full line search and, in particular, allows for adapting to the local smoothness of the function, such as in (Pedregosa et al., 2020), but comes with a significantly reduced computational cost, leading to higher effective rates of convergence. We provide theoretical guarantees and demonstrate the effectiveness of the strategy through numerical experiments.
Paperid:1057
Authors:Kaixuan Xu · Jiajun Chai · Sicheng Li · Yuqian Fu · Yuanheng Zhu · Dongbin Zhao
Abstract:
Diplomacy is a complex multiplayer game that requires both cooperation and competition, posing significant challenges for AI systems. Traditional methods rely on equilibrium search to generate extensive game data for training, which demands substantial computational resources. Large Lan- guage Models (LLMs) offer a promising alterna- tive, leveraging pre-trained knowledge to achieve strong performance with relatively small-scale fine-tuning. However, applying LLMs to Diplo- macy remains challenging due to the exponential growth of possible action combinations and the intricate strategic interactions among players. To address this challenge, we propose DipLLM, a fine-tuned LLM-based agent that learns equilib- rium policies for Diplomacy. DipLLM employs an autoregressive factorization framework to sim- plify the complex task of multi-unit action assign- ment into a sequence of unit-level decisions. By defining an equilibrium policy within this frame- work as the learning objective, we fine-tune the model using only 1.5% of the data required by the state-of-the-art Cicero model, surpassing its per- formance. Our results demonstrate the potential of fine-tuned LLMs for tackling complex strategic decision-making in multiplayer games.
Paperid:1058
Authors:Daniel Shao · Richard Chen · Andrew Song · Joel Runevic · Ming Y. Lu · Tong Ding · Faisal Mahmood
Abstract:
Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology for distilling embeddings from gigapixel tissue images into patientlevel representations to predict clinical outcomes. However, MIL is frequently challenged by the constraints of working with small, weakly-supervised clinical datasets. Unlike fields such as natural language processing and computer vision, which effectively use transfer learning to improve model quality in data-scarce environments, the transferability of MIL models remains largely unexplored. We conduct the first comprehensive investigation into transfer learning capabilities of pretrained MIL models, evaluating 11 MIL models across 19 pretraining tasks spanning tissue subtyping, cancer grading, and molecular subtype prediction. We observe a substantial performance boost with finetuning pretrained models over training from randomly initialized weights, even with domain differences between pretraining and target tasks. Pretraining on pan-cancer datasets enables consistent generalization across organs and task types compared to single-disease pretraining. Remarkably, this pan-cancer pretraining leads to better transfer than that of a state-of-the-art slide-level foundation model, while using only 6.5\% of the training data. These findings indicate that MIL architectures exhibit robust adaptability, offering insights into the benefits of leveraging pretrained models to enhance performance in computational pathology.
Paperid:1059
Authors:Chen Wang · Siyu Hu · Guangming Tan · Weile Jia
Abstract:
Pretrained interatomic potentials have become a new paradigm for atomistic materials simulations, enabling accurate and efficient predictions across diverse chemical systems. Despite their promise, fine-tuning is often required for complex tasks to achieve high accuracy. Traditional parameter-efficient fine-tuning approaches are effective in NLP and CV. However, when applied to equivariant pre-trained interatomic potentials, these methods will inevitably break equivariance—a critical property for preserving physical symmetries. In this paper, we introduce ELoRA (Equivariant Low-Rank Adaptation), a novel fine-tuning method designed specifically for equivariant Graph Neural Networks (GNNs), the backbones in multiple pre-trained interatomic potentials. ELoRA adopts a path-dependent decomposition for weights updating which offers two key advantages: (1) it preserves equivariance throughout the fine-tuning process, ensuring physically consistent predictions, and (2) it leverages low-rank adaptations to significantly improve data efficiency. We prove that ELoRA maintains equivariance and demonstrate its effectiveness through comprehensive experiments. On the rMD17 organic dataset, ELoRA achieves a 25.5\% improvement in energy prediction accuracy and a 23.7\% improvement in force prediction accuracy compared to full-parameter fine-tuning. Similarly, across 10 inorganic datasets, ELoRA achieves average improvements of 12.3\% and 14.4\% in energy and force predictions, respectively.
Paperid:1060
Authors:Huahui Yi · Wei Xu · Ziyuan Qin · Xi Chen · Xiaohu Wu · Kang Li · Qicheng Lao
Abstract:
Existing promptbased approaches have demonstrated impressive performance in continual learning, leveraging pre-trained large-scale models for classification tasks; however, the tight coupling between foreground-background information and the coupled attention between prompts and image-text tokens present significant challenges in incremental medical object detection tasks, due to the conceptual gap between medical and natural domains. To overcome these challenges, we introduce the iDPA framework, which comprises two main components: 1) Instance-level Prompt Generation (IPG), which decouples fine-grained instance-level knowledge from images and generates prompts that focus on dense predictions, and 2) Decoupled Prompt Attention (DPA), which decouples the original prompt attention, enabling a more direct and efficient transfer of prompt information while reducing memory usage and mitigating catastrophic forgetting. We collect 13 clinical, cross-modal, multi-organ, and multi-category datasets, referred to as ODinM-13, and experiments demonstrate that iDPA outperforms existing SOTA methods, with FAP improvements of 5.44%, 4.83%, 15.39%, and 4.59% in full data, 1-shot, 10-shot, and 50-shot settings, respectively.
Paperid:1061
Authors:Udaya Ghai · Karan Singh
Abstract:
Boosting provides a practical and provably effective framework for constructing accurate learning algorithms from inaccurate rules of thumb. It extends the promise of sampleefficient learning to settings where direct Empirical Risk Minimization (ERM) may not be implementable efficiently. In the realizable setting, boosting is known to offer this computational reprieve without compromising on sample efficiency. However, in the agnostic case, existing boosting algorithms fall short of achieving the optimal sample complexity. This paper highlights an unexpected and previously unexplored avenue of improvement: unlabeled samples. We design a computationally efficient agnostic boosting algorithm that matches the sample complexity of ERM, given polynomially many additional unlabeled samples. In fact, we show that the total number of samples (unlabeled, labeled inclusive) needed is never more than that for the best known agnostic boosting algorithm -- so this result is never worse -- while only a vanishing fraction of these need to be labeled for the algorithm to succeed. This is particularly fortuitous for learning-theoretic applications of agnostic boosting, which often take place in settings where unlabeled samples can be availed for free. We detail other applications of this result in reinforcement learning and learning theory.
Paperid:1062
Authors:Wenyin Zhou · Christopher I Sprague · Vsevolod Viliuga · Matteo Tadiello · Arne Elofsson · Hossein Azizpour
Abstract:
Molecular structure generation is a fundamental problem that involves determining the 3D positions of molecules' constituents. It has crucial biological applications, such as molecular docking, protein folding, and molecular design.Recent advances in generative modeling, such as diffusion models and flow matching, have made great progress on these tasks by modeling molecular conformations as a distribution.In this work, we focus on flow matching and adopt an energybased perspective to improve training and inference of structure generation models. Our view results in a mapping function, represented by a deep network, that is directly learned to \textit{iteratively} map random configurations, i.e. samples from the source distribution, to target structures, i.e. points in the data manifold. This yields a conceptually simple and empirically effective flow matching setup that is theoretically justified and has interesting connections to fundamental properties such as idempotency and stability, as well as the empirically useful techniques such as structure refinement in AlphaFold. Experiments on protein binding, both single and multi-ligand docking, as well as protein backbone generation consistently demonstrate the method's effectiveness, where it outperforms recent baselines of task-associated flow matching and diffusion models, using a similar computational budget.
Paperid:1063
Authors:Tim Steinert · David Ginsbourger · Ove Christiansen · August Lykke-Møller · Henry Moss
Abstract:
We study the incorporation of equivariances into vectorvalued GPs and more general classes of random field models. While kernels guaranteeing such equivariances have been investigated previously, their evaluation is often computationally prohibitive due to required integrations over the involved groups. In this work, we propose the first integration-free construction for equivariant kernels, leveraging the group-theoretic notion of fundamental domains.We establish data-efficient and computationally lightweight GP models for velocity fields and molecular electric dipole moments and demonstrate that our proposed kernel may be leveraged to extract equivariant components from data.
Paperid:1064
Authors:DiJia Su · Hanlin Zhu · Yingchen Xu · Jiantao Jiao · Yuandong Tian · Qinqing Zheng
Abstract:
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data, where the step-by-step thought process is explicitly outlined by text tokens.However, this results in lengthy inputs where many words support textual coherence rather than core reasoning information, and processing these inputs consumes substantial computation resources.In this work, we propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens generated by VQ-VAE, significantly reducing the length of reasoning traces. We explore the use of latent trace abstractions in two scenarios: 1) training the model from scratch for the Keys-Finding Maze problem, 2) fine-tuning LLMs on this hybrid data with an extended vocabulary including unseen latent tokens, for both logical and mathematical reasoning problems. To facilitate effective learning, we introduce a simple training procedure that randomly mixes latent and text tokens, which enables fast adaptation to new latent tokens. Our approach consistently outperforms the baselines methods in various benchmarks, such as Math (+4.2\%, Llama-3.2-1B), GSM8K (+4.1%, Llama-3.2-3B), and Fresh-Gaokao-Math-2023 (+13.3%, Llama-3.1-8B) with an average reduction of 17% in reasoning trace's length.
Paperid:1065
Authors:Unique Subedi · Ambuj Tewari
Abstract:
Abstract:We study active data collection strategies for operator learning when the target operator is linear and the input functions are drawn from a meanzero stochastic process with continuous covariance kernels. With an active data collection strategy, we establish an error convergence rate in terms of the decay rate of the eigenvalues of the covariance kernel. We can achieve arbitrarily fast error convergence rates with sufficiently rapid eigenvalue decay of the covariance kernels. This contrasts with thepassive (i.i.d.) data collection strategies, where the convergence rate is never faster than linear decay ($\sim n^{-1}$). In fact, for our setting, we show a \emph{non-vanishing} lower bound for any passive data collection strategy, regardless of the eigenvalues decay rate of the covariance kernel. Overall, our results show the benefit of active data collection strategies in operator learning over their passive counterparts.
Paperid:1066
Authors:Aviv Bick · Eric Xing · Albert Gu
Abstract:
Statespace models (SSMs) have demonstrated impressive efficiency and performance across a range of tasks. Despite these strengths, they face notable limitations when it comes to tasks that require structured reasoning, such as retrieval, copying, and in-context learning.In this work, we investigate the root cause of this limitation and discover that both RNN-based and Attention-based models rely on the same mechanism:Gather \& Aggregate, where the network learns to use two heads from different layers to collaboratively implement a retrieval algorithm.Gather Headspartition the input into logical parts, whileAggregation Headsintegrate the information from these parts.Our experiments show that, unlike knowledge, which is distributed across all layers and heads, the Gather & Aggregate mechanism is rare and highly concentrated. This highlights the critical role of these specialized heads in retrieval-based tasks. For example, disabling just one Gather Head or Aggregation Head reduces MMLU performance from 66% to 25%, equivalent to random guessing.Through various tasks, we identify a key limitation of SSM-based models: their restricted ability to implement effective Aggregation Heads. This understanding not only informs the design of next-generation models, but also sheds light on existing ones. For instance, we find that the success of Hybrid models lies in the optimization process, which shifts the responsibility for this mechanism to the Attention layers.
Paperid:1067
Authors:Shaoan Xie · Lingjing Kong · Yujia Zheng · Zeyu Tang · Eric Xing · Guangyi Chen · Kun Zhang
Abstract:
Concept learning seeks to extract semantic and interpretable representations of atomic concepts from highdimensional data such as images and text, which can be instrumental to a variety of downstream tasks (e.g., image generation/editing). Despite its importance, the theoretical foundations for learning atomic concepts and their interactions, especially from multimodal distributions, remain underexplored.In this work, we establish fundamental conditions on learning atomic multimodal concepts and their underlying interactions. We formulate concept learning as a latent variable identification problem, representing atomic concepts and their interactions as latent variables within a graphical model. Our theoretical contribution is providing component-wise identifiability of atomic concepts under flexible, nonparametric conditions that accommodate both continuous and discrete modalities. This advances beyond prior approaches that offer only block-wise identifiability or rely on parametric assumptions.
Paperid:1068
Authors:Huaicheng Zhou · Zifeng Zhuang · Donglin Wang
Abstract:
The integration of Deep Neural Networks (DNNs) in Reinforcement Learning (RL) systems has led to remarkable progress in solving complex tasks but also introduced challenges like primacy bias and dead neurons. Primacy bias skews learning towards early experiences, while dead neurons diminish the network's capacity to acquire new knowledge. Traditional reset mechanisms aimed at addressing these issues often involve maintaining large replay buffers to train new networks or selectively resetting subsets of neurons. However, These approaches either incur substantial computational costs or fail to effectively reset the entire network, resulting in underutilization of network plasticity and reduced learning efficiency. In this work, we introduce the novel concept of neuron regeneration, which combines reset mechanisms with knowledge recovery techniques. We also propose a new framework called Sustainable Backup Propagation(SBP) that effectively maintains plasticity in neural networks through this neuron regeneration process. The SBP framework achieves whole network neuron regeneration through two key procedures: cycle reset and inner distillation. Cycle reset involves a scheduled renewal of neurons, while inner distillation functions as a knowledge recovery mechanism at the neuron level. To validate our framework, we integrate SBP with Proximal Policy Optimization (PPO) and propose a novel distillation function for inner distillation. This integration results in Plastic PPO (P3O), a new algorithm that enables efficient cyclic regeneration of all neurons in the actor network. Extensive experiments demonstrate the approach effectively maintains policy plasticity and improves sample efficiency in reinforcement learning tasks.
Paperid:1069
Authors:Thibaud Gloaguen · Nikola Jovanović · Robin Staab · Martin Vechev
Abstract:
LLM watermarks stand out as a promising way to attribute ownership of LLMgenerated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning-based spoofing methods, providing insights into their fundamental limitations and suggesting a way to mitigate this threat.
Paperid:1070
Authors:Mahir Labib Dihan · Tanvir Hassan · Md Tanvir Parvez · Hasebul Hasan · Almash Alam · Muhammad Aamir Cheema · Mohammed Eunus Ali · Md Rizwan Parvez
Abstract:
Recent advancements in foundation models have improved autonomous tool usage and reasoning, but their capabilities in mapbased reasoning remain underexplored. To address this, we introduce MapEval, a benchmark designed to assess foundation models across three distinct tasks—textual, API-based, and visual reasoning—through 700 multiple-choice questions spanning 180 cities and 54 countries, covering spatial relationships, navigation, travel planning, and real-world map interactions. Unlike prior benchmarks that focus on simple location queries, MapEval requires models to handle long-context reasoning, API interactions and visual map analysis, making it the most comprehensive evaluation framework for geospatial AI. On evaluation of 28 foundation models, including Claude-3.5-Sonnet, GPT-4o, Gemini-1.5-Pro, none surpasses 67% accuracy, with open-source models performing significantly worse and all models lagging over 20% behind human performance. These results expose critical gaps in spatial inference, as models struggle with distances, directions, route planning, and place-specific reasoning, highlighting the need for better geospatial AI to bridge the gap between foundation models and real-world navigation.
Paperid:1071
Authors:Juyun Wee · Minjae Park · Jaeho Lee
Abstract:
Depth pruning aims to reduce the inference cost of a large language model without any hardwarespecific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent---a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routedDynamicDepth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.
Paperid:1072
Authors:Junhong Zhang · Zhihui Lai
Abstract:
Kernel methods are powerful tools for nonlinear learning with wellestablished theory. The scalability issue has been their long-standing challenge. Despite the existing success, there are two limitations in large-scale kernel methods: (i) The memory overhead is too high for users to afford; (ii) existing efforts mainly focus on kernel ridge regression (KRR), while other models lack study. In this paper, we proposeJoker, a joint optimization framework for diverse kernel models, including KRR, logistic regression, and support vector machines. We design a dual block coordinate descent method with trust region (DBCD-TR) and adopt kernel approximation with randomized features, leading to low memory costs and high efficiency in large-scale learning. Experiments show thatJokersaves up to 90\% memory but achieves comparable training time and performance (or even better) than the state-of-the-art methods.
Paperid:1073
Authors:Kristina Nikolić · Luze Sun · Jie Zhang · Florian Tramer
Abstract:
Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs.In this paper, we ask whether the model outputs produced by existing jailbreaks are actuallyuseful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions?Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easyto-evaluate topics (e.g., biology or math).Our evaluation of eight representative jailbreaks across five utility benchmarks shows vast disparities in the utility of model responses. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 97% in accuracy.Overall, our work proposes jailbreak utility as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks.
Paperid:1074
Authors:Loris Gaven · Thomas Carta · Clément Romac · Cédric Colas · sylvain lamprier · Olivier Sigaud · Pierre-Yves Oudeyer
Abstract:
Openended learning agents must efficiently prioritize goals in vast possibility spaces, focusing on those that maximize learning progress (LP). When such autotelic exploration is achieved by LLM agents trained with online RL in high-dimensional and evolving goal spaces, a key challenge for LP prediction is modeling one’s own competence, a form of metacognitive monitoring. Traditional approaches either require extensive sampling or rely on brittle expert-defined goal groupings. We introduce MAGELLAN, a metacognitive framework that lets LLM agents learn to predict their competence and learning progress online. By capturing semantic relationships between goals, MAGELLAN enables sample-efficient LP estimation and dynamic adaptation to evolving goal spaces through generalization. In an interactive learning environment, we show that MAGELLAN improves LP prediction efficiency and goal prioritization, being the only method allowing the agent to fully master a large and evolving goal space. These results demonstrate how augmenting LLM agents with a metacognitive ability for LP predictions can effectively scale curriculum learning to open-ended goal spaces.
Paperid:1075
Authors:Kiarash Banihashem · Xiang Chen · MohammadTaghi Hajiaghayi · Sungchul Kim · Kanak Mahadik · Ryan A Rossi · Tong Yu
Abstract:
Abstract:Metric embeddings are fundamental in machine learning, enabling similarity search, dimensionality reduction, and representation learning. They underpin modern architectures like transformers and large language models, facilitating scalable training and improved generalization.Theoretically, the classic problem in embedding design is mapping arbitrary metrics into $\ell_p$ spaces while approximately preserving pairwise distances. We study this problem in a fully dynamic setting, where the underlying metric is a graph metric subject to edge insertions and deletions.Our goal is to maintain an efficient embedding after each update.We present the first fully dynamic algorithm for this problem, achieving $O(\log(n))^{2q} O(\log(nW))^{q1}$ expected distortion with $O(m^{1/q + o(1)})$ update time and $O(q \log(n) \log(nW))$ query time, where $q \ge 2$ is an integer parameter.
Paperid:1076
Authors:Zilong Wang · Zhiyao Zhang · Shuai Li
Abstract:
Abstract:Multiplayer multi-armed bandit (MP-MAB) has been widely studied owing to its diverse applications across numerous domains. We consider an MP-MAB problem where $N$ players compete for $K$ arms in $T$ rounds. The reward distributions are heterogeneous where each player has a different expected reward for the same arm. When multiple players select the same arm, they collide and obtain zero rewards. In this paper, our target is to find the max-min fairness matching that maximizes the reward of the player who receives the lowest reward. This paper improves the existing max-min regret upper bound of $O(\exp(1/\Delta) + K^3 \log T\log \log T)$. More specifically, our decentralized fair elimination algorithm (DFE) deals with heterogeneity and collision carefully and attains a regret upper bound of $O((N^2+K)\log T / \Delta)$, where $\Delta$ is the minimum reward gap between max-min value and sub-optimal arms.In addition, this paper also provides an $\Omega(\max\{N^2, K\} \log T / \Delta)$ regret lower bound for this problem, which indicates that our algorithm is optimal with respect to key parameters $T, N, K$, and $\Delta$. Additional numerical experiments also show the efficiency and improvement of our algorithms.
Paperid:1077
Authors:Xiandong Zou · Wanyu LIN · Yuchen Li · Pan Zhou
Abstract:
Aligning Large Language Model (LLM) responses with human preferences is vital for building safe and controllable AI systems. While preference optimization methods based on PlackettLuce (PL) and Bradley-Terry (BT) models have shown promise, they face challenges such as poor handling of harmful content, inefficient use of dispreferred responses, and, specifically for PL, high computational costs. To address these issues, we propose Hard Preference Sampling (HPS), a novel framework for robust and efficient human preference alignment. HPS introduces a training loss that prioritizes the most preferred response while rejecting all dispreferred and harmful ones. It emphasizes “hard” dispreferred responses—those closely resembling preferred ones—to enhance the model’s rejection capabilities. By leveraging a single-sample Monte Carlo sampling strategy, HPS reduces computational overhead while maintaining alignment quality. Theoretically, HPS improves sample efficiency over existing PL methods and maximizes the reward margin between preferred and dispreferred responses, ensuring clearer distinctions. Experiments on HH-RLHF and PKU-Safety datasets validate HPS’s effectiveness, achieving comparable BLEU and reward scores while greatly improving reward margins and thus reducing harmful content generation.
Paperid:1078
Authors:Zhihao Li · Xue JIANG · liyuan liu · xuelin zhang · Hong Chen · Feng Zheng
Abstract:
Abstract:Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoderonly models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model's generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences $N$, the maximum length of sequence $m$, and the count of parameters in the transformer model $\Theta$. Additionally, experiments on public datasets verify our theoretical findings.
Paperid:1079
Authors:Ananth Balashankar · Ziteng Sun · Jonathan Berant · Jacob Eisenstein · Michael Collins · Adrian Hutter · Jong Lee · Chirag Nagpal · Flavien Prost · Aradhana Sinha · Ananda Suresh · Ahmad Beirami
Abstract:
Abstract:Language model alignment is a critical stepin training modern generative language models.Alignment targets to improve win rate of a samplefrom the aligned model against the base model.Today, we are increasingly using inferencetimealgorithms (e.g., Best-of-$N$ , controlled decoding, tree search) to decode from language modelsrather than standard sampling. We show that thistrain/test mismatch makes standard RLHF framework sub-optimal in view of such inference-timemethods. To this end, we propose a framework forinference-aware alignment (InfAlign), whichaims to optimize *inference-time win rate* of thealigned policy against the base model. We provethat for any inference-time decoding procedure,the optimal aligned policy is the solution to thestandard RLHF problem with a *transformation*of the reward. This motivates us to provide thecalibrate-and-transform RL (InfAlign-CTRL)algorithm to solve this problem, which involvesa reward calibration step and a KL-regularizedreward maximization step with a transformationof the calibrated reward. For best-of-$N$ samplingand best-of-$N$ jailbreaking, we propose specifictransformations offering up to 3-8% improvementon inference-time win rates. Finally, we also showthat our proposed reward calibration method is astrong baseline for optimizing standard win rate.
Paperid:1080
Authors:Feifei Kou · Jiahao Wang · Lei Shi · Yuhan Yao · Yawen Li · Suguo Zhu · Zhongbao Zhang · Junping Du
Abstract:
Longterm time series forecasting has been widely studied, yet two aspects remain insufficiently explored: the interaction learning between different frequency components and the exploitation of periodic characteristics inherent in timestamps. To address the above issues, we propose CFPT, a novel method that empowering time se-ries forecasting through Cross-Frequency Interaction (CFI) and Periodic-Aware Timestamp Modeling (PTM). To learn cross-frequency interactions, we propose the CFI branch to process signals in frequency domain and captures their interactions through a feature fusion mechanism. Furthermore, to enhance prediction performance by leveraging timestamp periodicity, we design the PTM branch which transforms timestamp sequences into 2D periodic tensors and utilizes 2D convolution to capture both intra-period dependencies and inter-period correlations of time series based on timestamp patterns. Extensive experiments on multiple real-world benchmarks demonstrate that CFPT achieves state-of-the-art performance in long-term forecasting tasks. Our code will soon be available.
Paperid:1081
Authors:Wenxin Tai · Ting Zhong · Goce Trajcevski · Fan Zhou
Abstract:
This work presents the first systematic investigation into the trustworthiness of explanations generated by selfinterpretable graph neural networks (GNNs), revealing why models trained with different random seeds yield inconsistent explanations. We identify redundancy -- resulting from weak conciseness constraints -- as the root cause of both explanation inconsistency and its associated inaccuracy, ultimately hindering user trust and limiting GNN deployment in high-stakes applications. Our analysis demonstrates that redundancy is inherently difficult to eliminate; however, a simple ensemble strategy can mitigate its detrimental effects. We validate our findings through extensive experiments across diverse datasets, model architectures, and self-interpretable GNN frameworks, providing a benchmark to guide future research on addressing redundancy and advancing GNN deployment in critical domains.
Paperid:1082
Authors:Haichen Hu · Rui Ai · Stephen Bates · David Simchi-Levi
Abstract:
Abstract:Contextual sequential decisionmaking problems play a crucial role in machine learning, encompassing a wide range of downstream applications such as bandits, sequential hypothesis testing and online risk control. These applications often require different statistical measures, including expectation, variance and quantiles. In this paper, we provide a universal admissible algorithm framework for dealing with all kinds of contextual online decision-making problems that directly learns the whole underlying unknown distribution instead of focusing on individual statistics. This is much more difficult because the dimension of the regression is uncountably infinite, and any existing linear contextual bandits algorithm will result in infinite regret. To overcome this issue, we propose an efficient infinite-dimensional functional regression oracle for contextual cumulative distribution functions (CDFs), where each data point is modeled as a combination of context-dependent CDF basis functions.Our analysis reveals that the decay rate of the eigenvalue sequence of the design integral operator governs the regression error rate, and consequently, the utility regret rate. Specifically, when the eigenvalue sequence exhibits a polynomial decay of order $\frac{1}{\gamma}\ge 1$, the utility regret is bounded by $\tilde{\mathcal{O}}\big(T^{\frac{3\gamma+2}{2(\gamma+2)}}\big)$. By setting $\gamma=0$, this recovers the existing optimal regret rate for contextual bandits with finite-dimensional regression and is optimal under a stronger exponential decay assumption. Additionally, we provide a numerical method to compute the eigenvalue sequence of the integral operator, enabling the practical implementation of our framework.
Paperid:1083
Authors:Jeremy Bernstein · Laker Newhouse
Abstract:
An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We derive GPUfriendly algorithms for dualizing Embed, Linear and Conv2D layers—the latter two methods are based on a Newton-Schulz iteration. We conclude with small experiments demonstrating the speed, scalability and novel numerical properties of duality-based optimizers. Our methods were used in the Muon optimizer, which recently set speed records for training NanoGPT and was scaled up to a 1.5 billion parameter transformer.
Paperid:1084
Authors:Xichen Guo · Feng Xie · Yan Zeng · Hao Zhang · zhi geng
Abstract:
We consider the problem of selecting instrumental variables from observational data, a fundamental challenge in causal inference. Existing methods mostly focus on additive linear, constant effects models, limiting their applicability in complex realworld scenarios.In this paper, we tackle a more general and challenging setting: the additive non-linear, constant effects model. We first propose a novel testable condition, termed the Cross Auxiliary-based independent Test (CAT) condition, for selecting the valid IV set. We show that this condition is both necessary and sufficient for identifying valid instrumental variable sets within such a model under milder assumptions. Building on this condition, we develop a practical algorithm for selecting the set of valid instrumental variables. Extensive experiments on both synthetic and two real-world datasets demonstrate the effectiveness and robustness of our proposed approach, highlighting its potential for broader applications in causal analysis.
Paperid:1085
Authors:Noa Rubin · Kirsten Fischer · Javed Lindner · Inbar Seroussi · Zohar Ringel · Michael Krämer · Moritz Helias
Abstract:
Theoretically describing feature learning in neural networks is crucial for understanding their expressive power and inductive biases, motivating various approaches. Some approaches describe network behavior after training through a simple change in kernel scale from initialization, resulting in a generalization power comparable to a Gaussian process. Conversely, in other approaches training results in the adaptation of the kernel to the data, involving complex directional changes to the kernel. While these approaches capture different facets of network behavior, their relationship and respective strengths across scaling regimes remains an open question. This work presents a theoretical framework of multiscale adaptive feature learning bridging these approaches. Using methods from statistical mechanics, we derive analytical expressions for network output statistics which are valid across scaling regimes and in the continuum between them. A systematic expansion of the network's probability distribution reveals that mean-field scaling requires only a saddle-point approximation, while standard scaling necessitates additional correction terms. Remarkably, we find across regimes that kernel adaptation can be reduced to an effective kernel rescaling when predicting the mean network output of a linear network. However, even in this case, the multi-scale adaptive approach captures directional feature learning effects, providing richer insights than what could be recovered from a rescaling of the kernel alone.
Paperid:1086
Authors:Ping Yu · Weizhe Yuan · Olga Golovneva · Tianhao Wu · Sainbayar Sukhbaatar · JASON WESTON · Jing Xu
Abstract:
Training data quality is one of the most important drivers of final model quality. In this work, we introduce a method for evaluating data integrity based on the assumption that lowquality input prompts result in high variance and low quality responses. This is achieved by measuring the rejected response quality and the reward gap between the chosen and rejected preference pair. Our method, Rejecting Instruction Preferences (RIP) can be used to filter prompts from existing training sets, or to make high quality synthetic datasets, yielding large performance gains across various benchmarks compared to unfiltered data. Using Llama 3.1-8B-Instruct, RIP improves AlpacaEval2 LC Win Rate by 9.4%, Arena-Hard by 8.7%, and WildBench by 9.9%. Using Llama 3.3-70B-Instruct, RIP improves Arena-Hard from 67.5 to 82.9, from 18th place to 6th overall in the leaderboard.
Paperid:1087
Authors:Xiaoyu Ma · Hao Chen · Yongjian Deng
Abstract:
Abstract:Different modalities hold considerable gaps in optimization trajectories, including speeds and paths, which lead to *modality laziness* and *modality clash* when jointly training multimodal models, resulting in insufficient and imbalanced multimodal learning.Existing methods focus on enforcing the weak modality by adding modalityspecific optimization objectives, aligning their optimization speeds, or decomposing multimodal learning to enhance unimodal learning. These methods fail to achieve both unimodal sufficiency and multimodal balance.In this paper, we, for the first time, address both concerns by proposing multimodal Data Remixing, including decoupling multimodal data and filtering hard samples for each modality to mitigate modality imbalance; and then batch-level reassembling to align the gradient directions and avoid cross-modal interference, thus enhancing unimodal learning sufficiency. Experimental results demonstrate that our method can be seamlessly integrated with existing approaches, improving accuracy by approximately **6.50\%$\uparrow$** on CREMAD and **3.41\%$\uparrow$** on Kinetic-Sounds, without training set expansion or additional computational overhead during inference. The source code will be released soon.
Paperid:1088
Authors:Gabriel Thompson · Kai Yue · Chau-Wai Wong · Huaiyu Dai
Abstract:
Decentralized federated learning (DFL) is a collaborative machine learning framework for training a model across participants without a central server or raw data exchange. DFL faces challenges due to statistical heterogeneity, as participants often possess different data distributions that reflect local environments and user behaviors. Recent work has shown that the neural tangent kernel (NTK) approach, when applied to federated learning in a centralized framework, can lead to improved performance. We propose an approach leveraging the NTK to train client models in the decentralized setting, while introducing a synergy between NTKbased evolution and model averaging. This synergy exploits inter-model variance and improves both accuracy and convergence in heterogeneous settings. Empirical results demonstrate that our approach consistently achieves higher accuracy than baselines in highly heterogeneous settings, where other approaches often underperform. Additionally, it reaches target performance in 4.6 times fewer communication rounds. We evaluate our approach across multiple datasets, network topologies, and heterogeneity settings to ensure robustness and generalizability.
Paperid:1089
Authors:Xinhao Zheng · Huiqi Deng · Quanshi Zhang
Abstract:
This paper focuses on the fundamental challenge of partitioning input variables in attribution methods for Explainable AI, particularly in Shapley valuebased approaches. Previous methods always compute attributions given a predefined partition but lack theoretical guidance on how to form meaningful variable partitions. We identify that attribution conflicts arise when the attribution of a coalition differs from the sum of its individual variables' attributions. To address this, we analyze the numerical effects of AND-OR interactions in AI models and extend the Shapley value to a new attribution metric for variable coalitions. Our theoretical findings reveal that specific interactions cause attribution conflicts, and we propose three metrics to evaluate coalition faithfulness. Experiments on synthetic data, NLP, image classification, and the game of Go validate our approach, demonstrating consistency with human intuition and practical applicability.
Paperid:1090
Authors:Sahil Goyal · Debapriya Tula · Gagan Jain · Pradeep Shenoy · Prateek Jain · Sujoy Paul
Abstract:
Abstract:Recent advances in visual generation have made significant strides in producing content of exceptional quality. However, most methods suffer from a fundamental problem a bottleneck of inference computational efficiency. Most of these algorithms involve multiple passes over a transformer model to generate tokens or denoise inputs. However, the model size is kept consistent throughout all iterations, which makes it computationally expensive. In this work, we aim to address this issue primarily through two key ideas - (a) not all parts of the generation process need equal compute, and we design a decode time model scaling schedule to utilize compute effectively, and (b) we can cache and reuse some of the computation. Combining these two ideas leads to using smaller models to process more tokens while large models process fewer tokens. These different-sized models do not increase the parameter size, as they share parameters. We rigorously experiment with ImageNet256$\times$256 , UCF101, and Kinetics600 to showcase the efficacy of the proposed method for image/video generation and frame prediction. Our experiments show that with almost $3\times$ less compute than baseline, our model obtains competitive performance.
Paperid:1091
Authors:Yannis Montreuil · Axel Carlier · Lai Xing Ng · Wei Tsang Ooi
Abstract:
Abstract:Learningto-Defer (L2D) facilitates optimal task allocation between AI systems and decision-makers. Despite its potential, we show that current two-stage L2D frameworks are highly vulnerable to adversarial attacks, which can misdirect queries or overwhelm decision agents, significantly degrading system performance. This paper conducts the first comprehensive analysis of adversarial robustness in _two-stage_ L2D frameworks. We introduce two novel attack strategies—_untargeted_ and _targeted_—that exploit inherent structural vulnerabilities in these systems. To mitigate these threats, we propose SARD, a robust, convex, deferral algorithm rooted in Bayes and $(\mathcal{R},\mathcal{G})$-consistency. Our approach guarantees optimal task allocation under adversarial perturbations for all surrogates in the cross-entropy family. Extensive experiments on classification, regression, and multi-task benchmarks validate the robustness of SARD.
Paperid:1092
Authors:Zelin Xu · Yupu Zhang · Tingsong Xiao · Maitane Lizaso · Jose Gonzalez-Ondina · Zibo Liu · Shigang Chen · Zhe Jiang
Abstract:
Abstract:Over 40\% of the global population lives within 100 kilometers of the coast, which contributes more than 8 trillion dollars annually to the global economy. Unfortunately, coastal ecosystems are increasingly vulnerable to more frequent and intense extreme weather events and rising sea levels due to climate change. Coastal scientists develop numerical models to simulate complex physical processes for forecasting water quality and storm surges. Such numerical models are slow and expensive. In recent years, deep learning has become a promising alternative to reduce the cost of physical models. However, progress has been hindered by the lack of a largescale, high-resolution coastal simulation dataset to train and evaluate deep learning models. Existing studies often focus on relatively small datasets and simple processes. To fill this gap, we introduce a decade-long, high-resolution ($<$100m) coastal circulation modeling dataset on a real-world 3D mesh with around 6 million cells. The dataset contains key oceanography variables (e.g., current velocities, free surface level, temperature, salinity) alongside external atmospheric and river forcings. We evaluated a customized ViT model that takes initial and boundary conditions and external forcings and predicts ocean variables at varying lead times. The dataset provides an opportunity to benchmark novel deep learning models for large-scale, high-resolution regional simulations in coastal sciences.
Paperid:1093
Authors:Quan Wei · Chung-Yiu Yau · Hoi To Wai · Yang Zhao · Dongyeop Kang · Youngsuk Park · Mingyi Hong
Abstract:
Supervised finetuning is a standard method for adapting pre-trained large language models (LLMs) to downstream tasks. Quantization has been recently studied as a post-training technique for efficient LLM deployment. To obtain quantized fine-tuned LLMs, conventional pipelines would first fine-tune the pre-trained models, followed by post-training quantization. This often yields suboptimal performance as it fails to leverage the synergy between fine-tuning and quantization. To effectively realize low-bit quantization of weights, activations, and KV caches in LLMs, we propose an algorithm named Rotated Straight-Through-Estimator (RoSTE), which combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that identifies an effective rotation configuration to reduce activation outliers. We provide theoretical insights on RoSTE by analyzing its prediction error when applied to an overparameterized least square quantized training problem. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration. Experiments on Pythia and Llama models of different sizes demonstrate the effectiveness of RoSTE. Compared to existing post-SFT quantization baselines, our method consistently achieves superior performances across various tasks and different LLM architectures.
Paperid:1094
Authors:Yupeng Hou · Jianmo Ni · Zhankui He · Noveen Sachdeva · Wang-Cheng Kang · Ed Chi · Julian McAuley · Derek Cheng
Abstract:
Abstract:Generative recommendation (GR) is an emerging paradigm where user actions are tokenized into discrete token patterns and autoregressively generated as predictions. However, existing GR models tokenize each action independently, assigning the same fixed tokens to identical actions across all sequences without considering contextual relationships. This lack of contextawareness can lead to suboptimal performance, as the same action may hold different meanings depending on its surrounding context. To address this issue, we propose ActionPiece to explicitly incorporate context when tokenizing action sequences. In ActionPiece, each action is represented as a *set* of item features, which serve as the initial tokens. Given the action sequence corpora, we construct the vocabulary by merging feature patterns as new tokens, based on their co-occurrence frequency both within individual sets and across adjacent sets. Considering the unordered nature of feature sets, we further introduce set permutation regularization, which produces multiple segmentations of action sequences with the same semantics. Experiments on public datasets demonstrate that ActionPiece consistently outperforms existing action tokenization methods, improving NDCG@$10$ by $6.00$% to $12.82$%.
Paperid:1095
Authors:Tosca Lechner · Alex Bie · Gautam Kamath
Abstract:
We consider the question of learnability of distribution classes in the presence of adaptive adversaries that is, adversaries capable of intercepting the samples requested by a learner and applying manipulations with full knowledge of the samples before passing it on to the learner. This stands in contrast to oblivious adversaries, who can only modify the underlying distribution the samples come from but not their i.i.d.\ nature. We formulate a general notion of learnability with respect to adaptive adversaries, taking into account the budget of the adversary. We show that learnability with respect to additive adaptive adversaries is a strictly stronger condition than learnability with respect to additive oblivious adversaries.
Paperid:1096
Authors:Yaxin Hou · Yuheng Jia
Abstract:
This paper studies the longtailed semi-supervised learning (LTSSL) with distribution mismatch, where the class distribution of the labeled training data follows a long-tailed distribution and mismatches with that of the unlabeled training data. Most existing methods introduce auxiliary classifiers (experts) to model various unlabeled data distributions and produce pseudo-labels, but the expertises of various experts are not fully utilized. We observe that different experts are good at predicting different intervals of samples, e.g., long-tailed expert is skilled in samples located in the head interval and uniform expert excels in samples located in the medium interval. Therefore, we propose a dynamic expert assignment module that can estimate the class membership (i.e., head, medium, or tail class) of samples, and dynamically assigns suitable expert to each sample based on the estimated membership to produce high-quality pseudo-label in the training phase and produce prediction in the testing phase. We also theoretically reveal that integrating different experts' strengths will lead to a smaller generalization error bound. Moreover, we find that the deeper features are more biased toward the head class but with more discriminative ability, while the shallower features are less biased but also with less discriminative ability. We, therefore, propose a multi-depth feature fusion module to utilize different depth features to mitigate the model bias. Our method demonstrates its effectiveness through comprehensive experiments on the CIFAR-10-LT, STL-10-LT, and SVHN-LT datasets across various settings.The code has been submitted to the Supplementary Material.
Paperid:1097
Authors:Xinghe Fu · Zhiyuan Yan · Zheng Yang · Taiping Yao · Yandan Zhao · Shouhong Ding · Xi Li
Abstract:
Fake images, created by recently advanced generative models, have become increasingly indistinguishable from real ones, making their detection crucial, urgent, and challenging. This paper introduces PiD (Pixelwise Decomposition Residuals), a novel detection method that focuses on residual signals within images. Generative models are designed to optimize highlevel semantic content (principal components), often overlooking low-level signals (residual components). PiD leverages this observation by disentangling residual components from raw images, encouraging the model to uncover more underlying and general forgery clues independent of semantic content. Compared to prior approaches that rely on reconstruction techniques or high-frequency information, PiD is computationally efficient and does not rely on any generative models for reconstruction. Specifically, PiD operates at the pixel level, mapping the pixel vector to another color space (e.g., YUV) and then quantizing the vector. The pixel vector is mapped back to the RGB space and the quantization loss is taken as the residual for AIGC detection. Our experiment results are striking and highly surprising: PiD achieves 98% accuracy on the widely used GenImage benchmark, highlighting the effectiveness and generalization performance.
Paperid:1098
Authors:Shilong Tao · Zhe Feng · Haonan Sun · Zhanxing Zhu · Yunhuai Liu
Abstract:
Multisolid systems are foundational to a wide range of real-world applications, yet modeling their complex interactions remains challenging. Existing deep learning methods predominantly rely on implicit modeling, where the factors influencing solid deformation are not explicitly represented but are instead indirectly learned. However, as the number of solids increases, these methods struggle to accurately capture intricate physical interactions. In this paper, we introduce a novel explicit modeling paradigm that incorporates factors influencing solid deformation through structured modules. Specifically, we present Unisoma, a unified and flexible Transformer-based model capable of handling variable numbers of solids. Unisoma directly captures physical interactions using contact modules and adaptive interaction allocation mechanism, and learns the deformation through a triplet relationship. Compared to implicit modeling techniques, explicit modeling is more well-suited for multi-solid systems with diverse coupling patterns, as it enables detailed treatment of each solid while preventing information blending and confusion. Experimentally, Unisoma achieves consistent state-of-the-art performance across seven well-established datasets and two complex multi-solid tasks.
Paperid:1099
Authors:Tianyu Liu · kai sun · Fuchun Sun · Yu Luo · Yuanlong Zhang
Abstract:
Temporal sequences, even after stationarization, often exhibit \textbf{leptokurtic distributions} with fat tails and persistent distribution shifts. These properties destabilize feature dynamics, amplify model variance, and hinder convergence in time series forecasting. To address this, we proposeMorphingFlow (MoF), a framework that combines a spline-based transform layer (Flow) and a test-time-trained method (Morph), which adaptively normalizes non-stationary, fat-tailed distributions while preserving critical extremal features. MoF ensures that inputs remain within a network’s effective activation space—a structured, normal-like distribution—even under distributional drift. Experiments across eight datasets show that MoF achieves state-of-the-art performance: with a simple linear backbone, it matches the performance of SOTA models on datasets such as Electricity and ETTh2. When paired with a patch-based Mamba architecture, MoF outperforms the closest competitor by6.3\%on average and reduces forecasting errors in fat-tail datasets such as Exchange by21.7\%. Notably, MoF acts as a plug-and-play module, boosting performance in existing models without architectural changes.
Paperid:1100
Authors:Aayushya Agarwal · Lawrence Pileggi · Gauri Joshi
Abstract:
Federated learning harnesses the power of distributed optimization to train a unified machine learning model across separate clients. However, heterogeneous data distributions and computational workloads can lead to inconsistent updates and limit model performance. This work tackles these challenges by proposing FedECADO, a new algorithm inspired by a dynamical system representation of the federated learning process. FedECADO addresses nonIID data distribution through an aggregate sensitivity model that reflects the amount of data processed by each client. To tackle heterogeneous computing, we design a multi-rate integration method with adaptive step-size selections that synchronizes active client updates in continuous time. Compared to prominent techniques, including FedProx, FedExp, and FedNova, FedECADO achieves higher classification accuracies in numerous heterogeneous scenarios.
Paperid:1101
Authors:Zhixin Li · Yuheng Jia · Hui LIU · Junhui Hou
Abstract:
Although deep clustering is an unsupervised approach without relying on labels, it still requires artificially designed supervision signals to train models suitable for clustering. Previous deep clustering methods have investigated various supervision signals like simple similarity and pseudo labels, but most of them lack an exploration into the the training progress of each sample. In this work, we observe that the sample stability in unsupervised training progress is closely associated with clustering prediction and network memorization at the level of individual samples. Specifically, we discover that samples with unstable representations across different epochs are more likely to be mispredicted. Meanwhile, these samples are more difficult to be memorized than regular samples and appear to be atypical. Based on the observations, we leverage sample stability to construct supervision signals for the first time. We propose a novel deep clustering method that learns from sample stability, termed LFSS. Experiments on various widely used benchmark datasets demonstrate that our method outperforms multiple stateof-the-art methods, verifying the effectiveness of our approach. Additionally, LFSS can be used as a plugin to improve the performance of various deep clustering methods. The code is included in the Supplementary Material.
Paperid:1102
Authors:Kaijie Zhu · Xianjun Yang · Jindong Wang · Wenbo Guo · William Wang
Abstract:
Recent research has explored that LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in toolretrieved information can redirect the agent to take unauthorized actions. Existing defenses against IPI have significant limitations: either require essential model training resources, lack effectiveness against sophisticated attacks, or harm the normal utilities. We present MELON (Masked re-Execution and TooL comparisON), a novel IPI defense. Our approach builds on the observation that under a successful attack, the agent’s next action becomes less dependent on user tasks and more on malicious tasks. Following this, we design MELON to detect attacks by re-executing the agent’s trajectory with a masked user prompt modified through a masking function. We identify an attack if the actions generated in the original and masked executions are similar. We also include three key designs to reduce the potential false positives and false negatives. Extensive evaluation on the IPI benchmark AgentDojo demonstrates that MELON outperforms SOTA defenses in both attack prevention and utility preservation. Moreover, we show that combining MELON with a SOTA prompt augmentation defense (denoted as MELON-Aug) further improves its performance. We also conduct a detailed ablation study to validate our key designs.
Paperid:1103
Authors:Corinna Cortes · Anqi Mao · Mehryar Mohri · Yutao Zhong
Abstract:
Abstract:Class imbalance remains a major challenge in machine learning, especially in multiclass problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strong $H$-consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (*Imbalanced Margin Maximization*), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines.
Paperid:1104
Authors:Letian Chen · Nina Moorman · Matthew Gombolay
Abstract:
Reinforcement learning (RL) has demonstrated compelling performance in robotic tasks, but its success often hinges on the design of complex, ad hoc reward functions. Researchers have explored how Large Language Models (LLMs) could enable nonexpert users to specify reward functions more easily. However, LLMs struggle to balance the importance of different features, generalize poorly to out-of-distribution robotic tasks, and cannot represent the problem properly with only text-based descriptions. To address these challenges, we propose ELEMENTAL (intEractive LEarning froM dEmoNstraTion And Language), a novel framework that combines natural language guidance with visual user demonstrations to align robot behavior with user intentions better. By incorporating visual inputs, ELEMENTAL overcomes the limitations of text-only task specifications, while leveraging inverse reinforcement learning (IRL) to balance feature weights and match the demonstrated behaviors optimally. ELEMENTAL also introduces an iterative feedback-loop through self-reflection to improve feature, reward, and policy learning. Our experiment results demonstrate that ELEMENTAL outperforms prior work by 42.3% on task success, and achieves 41.3% better generalization in out-of-distribution tasks, highlighting its robustness in LfD.
Paperid:1105
Authors:Gabriel Tseng · Anthony Fuller · Marlena Reil · Henry Herzog · Patrick Beukema · Favyen Bastani · James Green · Evan Shelhamer · Hannah Kerner · David Rolnick
Abstract:
From crop mapping to flood detection, machine learning in remote sensing has a wide range of societally beneficial applications. The commonalities between remote sensing data in these applications present an opportunity for pretrained machine learning models tailored to remote sensing to reduce the labeled data and effort required to solve individual tasks. However, such models must be: (i) flexible enough to ingest input data of varying sensor modalities and shapes (i.e., of varying spatial and temporal dimensions), and (ii) able to model Earth surface phenomena of varying scales and types. To solve this gap, we present Galileo, a family of pretrained remote sensing models designed to flexibly process multimodal remote sensing data. We also introduce a novel and highly effective selfsupervised learning approach to learn both large- and small-scale features, a challenge not addressed by previous models. Our Galileo models obtain state-of-the-art results across diverse remote sensing tasks.
Paperid:1106
Authors:Yinxuan Huang · KE LIANG · Jingao Xu · Zhuofan Dong · Xiaodong Qu · Yue Han · Wang Tianxiang · Ye Wang · Bin Zhou
Abstract:
Graphbased social recommendation (SR) models suffer from various noises of the social graphs, hindering their recommendation performances. Either graph-level redundancy or graph-level missing will indeed influence the social graph structures, further influencing the message propagation procedure of graph neural networks (GNNs). Generative models, especially diffusion-based models, are usually used to reconstruct and recover the data in better quality from original data with noises. Motivated by it, a few works take attempts on it for social recommendation. However, they can only handle isotropic Gaussian noises but fail to leverage the anisotropic ones. Meanwhile the anisotropic relational structures in social graphs are commonly seen, so that existing models cannot sufficiently utilize the graph structures, which constraints the capacity of noise removal and recommendation performances. Compared to the diffusion strategy, the flow matching strategy shows better ability to handle the data with anisotropic noises since they can better preserve the data structures during the learning procedure. Inspired by it, we propose RecFlow which is the first flow-matching based SR model. Concretely, RecFlow performs flow matching on the structure representations of social graphs. Then, a conditional learning procedure is designed for optimization. Extensive performances prove the promising performances of our RecFlow from six aspects, including superiority, effectiveness, robustnesses, sensitivity, convergence and visualization.
Paperid:1107
Authors:Matteo Saponati · Pascal J. Sager · Pau Vilimelis Aceituno · Thilo Stadelmann · Benjamin F. Grewe
Abstract:
Selfattention is essential to Transformer architectures, yet how information is embedded in the self-attention matrices and how different objective functions impact this process remains unclear. We present a mathematical framework to analyze self-attention matrices by deriving the structures governing their weight updates. Using this framework, we demonstrate that bidirectional training induces symmetry in the weight matrices, while autoregressive training results in directionality and column dominance. Our theoretical findings are validated across multiple Transformer models — including ModernBERT, GPT, LLaMA3, and Mistral — and input modalities like text, vision, and audio. Finally, we apply these insights by showing that symmetric initialization improves the performance of encoder-only models on language tasks. This mathematical analysis offers a novel theoretical perspective on how information is embedded through self-attention, thereby improving the interpretability of Transformer models.
Paperid:1108
Authors:Michal Wilinski · Mononito Goswami · Nina Żukowska · Willa Potosnak · Artur Dubrawski
Abstract:
Time series foundation models (TSFMs) promise to be powerful tools for a wide range of applications. However, their internal representations and learned concepts are still not well understood. In this study, we investigate the structure and redundancy of representations across various TSFMs, examining the selfsimilarity of model layers within and across different model sizes. This analysis reveals block-like redundancy in the representations, which can be utilized for informed pruning to improve inference speed and efficiency. Additionally, we explore the concepts learned by these models—such as periodicity and trends—and how these can be manipulated through latent space steering to influence model behavior. Our experiments show that steering interventions can introduce new features, e.g., adding periodicity or trends to signals that initially lacked them. These findings underscore the value of representational analysis for optimizing models and demonstrate how conceptual steering offers new possibilities for more controlled and efficient time series analysis with TSFMs.
Paperid:1109
Authors:Franck TALLA · Edouard Grave · Herve Jegou
Abstract:
We address the problem of extending a pretrained large language model to a new domain that was not seen during training. Standard techniques, such as fine-tuning or low-rank adaptation (LoRA) are successful at domain adaptation, but do not formally add capacity to the model. This often leads to a trade-off, between performing well on the new domain vs. degrading performance on the original domain.Here, we propose to revisit and improve adapters to extend LLMs. Our paper analyzes this extension problem from three angles: data, architecture and training procedure, which are advantageously considered jointly. The resulting method, called neutral residues, modifies adapters in a way that leads to each new residual block to output near-zeros on the original domain. This solution leads to strong results when adapting a state-of-the-art model originally trained on English to a new language. Neutral residues significantly outperforms competing approaches such as fine-tuning, LoRA or vanilla adapters in terms of the trade-off between learning the new language and not forgetting English.
Paperid:1110
Authors:Kentaro Kanamori · Ken Kobayashi · Satoshi Hara · Takuya Takagi
Abstract:
Algorithmic recourse aims to provide a recourse action for altering an unfavorable prediction given by a model into a favorable one (e.g., loan approval). In practice, it is also desirable to ensure that an action makes the realworld outcome better (e.g., loan repayment). We call this requirementimprovement. Unfortunately, existing methods cannot ensure improvement unless we know the true oracle. To address this issue, we propose a framework for suggesting improvement-oriented actions from the long-term perspective. Specifically, we introduce a new online learning task of assigning actions to a given sequence of instances. We assume that we can observe delayed feedback on whether the past suggested action achieved improvement. Using the feedback, we estimate an action that can achieve improvement for each instance. To solve this task, we propose two approaches based on contextual linear bandit and contextual Bayesian optimization. Experimental results demonstrated that our approaches could assign improvement-oriented actions to more instances than the existing methods.
Paperid:1111
Authors:Seungbeom Lee · Munsun Jo · Jungseul Ok · Dongwoo Kim
Abstract:
Deep learningbased Structure-based drug design aims to generate ligand molecules with desirable properties for protein targets. While existing models have demonstrated competitive performance in generating ligand molecules, they primarily focus on learning the chemical distribution of training datasets, often lacking effective steerability to ensure the desired chemical quality of generated molecules. To address this issue, we propose a multi-reward optimization framework that fine-tunes generative models for attributes, such as binding affinity, validity, and drug-likeness, together. Specifically, we derive direct preference optimization for a Bayesian flow network, used as a backbone for molecule generation, and integrate a reward normalization scheme to adopt multiple objectives. Experimental results show that our method generates more realistic ligands than baseline models while achieving higher binding affinity, expanding the Pareto front empirically observed in previous studies.
Paperid:1112
Authors:Zhaolu Tim Liu · Robert Peach · Mauricio Barahona
Abstract:
Abstract:Kernelbased hypothesis tests offer a flexible, non-parametric tool to detect high-order interactions in multivariate data, beyond pairwise relationships. Yet, their scalability is limited due to the computationally demanding permutation schemes used to generate null approximations. Here we introduce a family of permutation-free high-order tests for joint independence and partial factorisations of $d$ variables. Our tests eliminate the need for permutation-based approximations by leveraging V-statistics and a novel cross-centring technique to yield test statistics with a standard normal limiting distribution under the null. We present implementations of the tests and showcase their efficacy and scalability through synthetic datasets. We also show applications inspired by causal discovery and feature selection, which highlight both the importance of high-order interactions in data and the need for efficient computational methods.
Paperid:1113
Authors:Damian Smith · Harvey Mannering · Antonia Marcu
Abstract:
A common belief in the representational comparison literature is that if two representations can be functionally aligned, they must capture similar information. In this paper we focus on model stitching and show that models can be functionally aligned, but represent very different information. Firstly, we show that discriminative models with very different biases can be stitched together. We then show that models trained to solve entirely different tasks on different data modalities, and even clustered random noise, can be successfully stitched into MNIST or ImageNettrained models. We end with a discussion of the wider impact of our results on the community's current beliefs. Overall, our paper draws attention to the need to correctly interpret the results of such functional similarity measures and highlights the need for approaches that capture informational similarity.
Paperid:1114
Authors:Andrei Polubarov · Nikita Lyubaykin · Alexander Derevyagin · Ilya Zisman · Denis Tarasov · Alexander Nikulin · Vladislav Kurenkov
Abstract:
InContext Reinforcement Learning (ICRL) represents a promising paradigm for developing generalist agents that learn at inference time through trial-and-error interactions, analogous to how large language models adapt contextually, but with a focus on reward maximization. However, the scalability of ICRL beyond toy tasks and single-domain settings remains an open challenge. In this work, we present the first steps toward scaling ICRL by introducing a fixed, cross-domain model capable of learning behaviors through in-context reinforcement learning. Our results demonstrate that Algorithm Distillation, a framework designed to facilitate ICRL, offers a compelling and competitive alternative to expert distillation to construct versatile action models. These findings highlight the potential of ICRL as a scalable approach for generalist decision-making systems.
Paperid:1115
Authors:Yunlong Hou · Fengzhuo Zhang · Cunxiao Du · Xuan Zhang · Jiachun Pan · Tianyu Pang · Chao Du · Vincent Tan · Zhuoran Yang
Abstract:
Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a trainingfree online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.
Paperid:1116
Authors:Xin Wang · Shengfei Lyu · Luo Chi · Xiren Zhou · Huanhuan Chen
Abstract:
A key challenge in personalized healthcare is identifying optimal intervention sequences to guide temporal systems toward target outcomes, a novel problem we formalize as counterfactual target achievement. In addressing this problem, directly adopting counterfactual estimation methods face compounding errors due to the unobservability of counterfactuals. To overcome this, we propose Variational Counterfactual Intervention Planning (VCIP), which reformulates the problem by modeling the conditional likelihood of achieving target outcomes, implemented through variational inference. By leveraging the gformula to bridge the gap between interventional and observational log-likelihoods, VCIP enables reliable training from observational data. Experiments on both synthetic and real-world datasets show that VCIP significantly outperforms existing methods in target achievement accuracy.
Paperid:1117
Authors:Gengshuo Chang · Wei Zhang · Lehan Zhang
Abstract:
Matrix completion aims to recover missing entries in a data matrix using a subset of observed entries. Previous studies show that side information can greatly enhance completion accuracy, but most assume perfect side information, which is rarely available in practice. In this paper, we propose an orthogonal complement matrix completion (OCMC) model to address the challenge of matrix completion with incomplete side information. The model leverages the orthogonal complement projection derived from the available side information, generalizing the traditional perfect side information matrix completion to the scenarios with incomplete side information .Moreover, using probably approximately correct (PAC) learning theory, we show that the sample complexity of OCMC model decreases quadratically with the completeness level. To efficiently solve the OCMC model, a linearized Lagrangian algorithm is developed with convergence guarantees. Experimental results show that the proposed OCMC model outperforms stateof-the-art methods on both synthetic data and real-world applications.
Paperid:1118
Authors:Ruocheng Wang · Zhuo Xia · Ge Yan · Junchi Yan
Abstract:
Differential equations are essential and popular in science and engineering. Recently learningbased methods including neural operators, have emerged as a promising paradigm. Along this direction, we explore its quantum counterpart, and propose QuanONet -- a quantum neural operator which has not been well studied in literature compared with their counterparts in other machine learning areas, and try to explore an in-depth understanding about such methods w.r.t. their pros and cons. Specifically, we design the architecture as a hardware-efficient and light-weighted ansatzs without impractical assumptions. Its circuit is pure quantum and thus is more readily for running on quantum computers compared with those classic-quantum hybrid circuits. By lying its ground on the operator approximation theorem for its quantum counterpart as proved in this paper, QuanONet can fit differential equation operators to achieve the general utility. In particular, we propose its improved version TF-QuanONet that shows ability to overcome the difficulty of setting coefficients and improve the robustness. Empirical results on various problems including antiderivative operators, Diffusion-reaction Systems demonstrate that QuanONet not only has the strong approximation capabilities but also offers significant improvements in generalization ability compared with other quantum methods.
Paperid:1119
Authors:Genghan Zhang · Weixin Liang · Olivia Hsu · Kunle Olukotun
Abstract:
Abstract:ML libraries, often written in architecturespecific programming languages (ASPLs) that target domain-specific architectures, are key to efficient ML systems. However, writing these high-performance ML libraries is challenging because it requires expert knowledge of ML algorithms and the ASPL. Large language models (LLMs), on the other hand, have shown general coding capabilities. However, challenges remain when using LLMs for generating ML libraries using ASPLs because 1) this task is complicated even for experienced human programmers and 2) there are limited code examples because of the esoteric and evolving nature of ASPLs. Therefore, LLMs need complex reasoning with limited data in order to complete this task. To address these challenges, we introduce an adaptive self-improvement agentic system. In order to evaluate the effectiveness of our system, we construct a benchmark of a typical ML library and generate ASPL code with both open and closed-source LLMs on this benchmark. Our results show improvements of up to $3.9\times$ over a baseline single LLM.
Paperid:1120
Authors:Zhaohong Huang · Yuxin Zhang · JingJing Xie · Fei Chao · Rongrong Ji
Abstract:
Recent advances in testtime adaptation (TTA) for Vision-Language Models (VLMs) have garnered increasing attention, particularly through the use of multiple augmented views of a single image to boost zero-shot generalization. Unfortunately, existing methods fail to strike a satisfactory balance between performance and efficiency, either due to excessive overhead of tuning text prompts or unstable benefits from handcrafted, training-free visual feature enhancement. In this paper, we present Global-Spatial Bias Learner (GS-Bias), an efficient and effective TTA paradigm that incorporates two learnable biases during TTA, unfolded as the global bias and spatial bias. Particularly, the global bias captures the global semantic features of a test image by learning consistency across augmented views, while spatial bias learns the semantic coherence between regions in the image’s spatial visual representation. It is worth highlighting that these two sets of biases are directly added to the logits outputed by the pretrained VLMs, which circumvent the full backpropagation through VLM that hinders the efficiency of existing TTA methods. This endows GS-Bias with extremely high efficiency while achieving state-of-the-art performance on 15 benchmark datasets. For example, it achieves a 2.23% improvement over TPT in cross-dataset generalization and a 2.72% improvement in domain generalization, while requiring only 6.5% of TPT's memory usage on ImageNet.
Paperid:1121
Authors:Georgy Noarov · Riccardo Fogliato · Martin A Bertran · Aaron Roth
Abstract:
Abstract:We study the design of adaptive, sequential experiments for unbiased average treatment effect (ATE) estimation in the designbased potential outcomes setting. Our goal is to develop adaptive designs offering *sublinear Neyman regret*, meaning their efficiency must approach that of the hindsight-optimal nonadaptive design.Recent work [Dai et al, 2023] introduced ClipOGD, the first method achieving $\widetilde{O}(\sqrt{T})$ expected Neyman regret under mild conditions. In this work, we propose adaptive designs with substantially stronger Neyman regret guarantees. In particular, we modify ClipOGD to obtain anytime $\widetilde{O}(\log T)$ Neyman regret under natural boundedness assumptions. Further, in the setting where experimental units have pre-treatment covariates, we introduce and study a class of contextual ``multigroup'' Neyman regret guarantees: Given a set of possibly overlapping groups based on the covariates, the adaptive design outperforms each group's best non-adaptive designs. In particular, we develop a contextual adaptive design with $\widetilde{O}(\sqrt{T})$ anytime multigroup Neyman regret. We empirically validate the proposed designs through an array of experiments.
Paperid:1122
Authors:Hongzhi Huang · Defa Zhu · Banggu Wu · Zeng · Ya Wang · Qiyang Min · zhou Xun
Abstract:
Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance remains underexplored. In this paper, we introduce OverTokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach expands input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, no matter model sizes. Our findings establish tokenization as a critical factor in scaling laws and offer new insights into tokenizer design, paving the way for more efficient and powerful LLMs.
Paperid:1123
Authors:Yunshu Dai · Jianwei Fei · Fangjun Huang · Chip Hong Chang
Abstract:
As AI generative models evolve, face swap technology has become increasingly accessible, raising concerns over potential misuse. Celebrities may be manipulated without consent, and ordinary individuals may fall victim to identity fraud. To address these threats, we propose Secure Swap, a method that protects persons of interest (POI) from faceswapping abuse and embeds a unique, invisible watermark into nonPOI swapped images for traceability. By introducing an ID Passport layer, Secure Swap redacts POI faces and generates watermarked outputs for nonPOI. A detachable watermark encoder and decoder are trained with the model to ensure provenance tracing. Experimental results demonstrate that Secure Swap not only preserves face swap functionality but also effectively prevents unauthorized swaps of POI and detects different embedded model's watermarks with high accuracy. Specifically, our method achieves a 100% success rate in protecting POI and over 99% watermark extraction accuracy for nonPOI. Besides fidelity and effectiveness, the robustness of protected models against image-level and model-level attacks in both online and offline application scenarios is also experimentally demonstrated.
Paperid:1124
Authors:Anh Nhu · Sanghyun Son · Ming Lin
Abstract:
Abstract:In this work, we introduce TimeAware World Model, a model-based approach designed to explicitly incorporate the temporal dynamics of environments. By conditioning on the time step size, $\Delta t$, and training over a diverse range of $\Delta t$ values - rather than relying on a fixed time step size - our model enables learning of both high- and low-frequency task dynamics in real-world control problems. Inspired by the information-theoretic principle that the optimal sampling rate varies depending on the underlying dynamics of different physical systems, our time-aware model enhances both performance and learning efficiency. Empirical evaluations demonstrate that our model consistently outperforms baseline approaches across different observation rates in various control tasks, using the same number of training samples and iterations. We will release our source code on GitHub upon the acceptance of the paper.
Paperid:1125
Authors:Jun Chen · Hong Chen · Yonghua Yu · Yiming Ying
Abstract:
Abstract:In recent years, contrastive learning has achieved stateof-the-art performance in the territory of self-supervised representation learning. Many previous works have attempted to provide the theoretical understanding underlying the success of contrastive learning. Almost all of them rely on a default assumption, i.e., the label consistency assumption, which may not hold in practice (the probability of failure is called labeling error) due to the strength and randomness of common augmentation strategies, such as random resized crop (RRC). This paper investigates the theoretical impact of labeling error on the downstream classification performance of contrastive learning. We first reveal several significant negative impacts of labeling error on downstream classification risk. To mitigate these impacts, data dimensionality reduction method (e.g., singular value decomposition, SVD) is applied on original data to reduce false positive samples, and establish both theoretical and empirical evaluations. Moreover, it is also found that SVD acts as a double-edged sword, which may lead to the deterioration of downstream classification accuracy due to the reduced connectivity of the augmentation graph. Based on the above observations, we give the augmentation suggestion that we should use some moderate embedding dimension (such as $512, 1024$ in our experiments), data inflation, weak augmentation, and SVD to ensure large graph connectivity and small labeling error to improve model performance.
Paperid:1126
Authors:Marta Gentiloni Silveri · Antonio Ocello
Abstract:
Abstract:Scorebased Generative Models (SGMs) aim to sample from a target distribution by learning score functions using samples perturbed by Gaussian noise. Existing convergence bounds for SGMs in the $\mathcal{W}_2$-distance rely on stringent assumptions about the data distribution. In this work, we present a novel framework for analyzing $\mathcal{W}_2$-convergence in SGMs, significantly relaxing traditional assumptions such as log-concavity and score regularity. Leveraging the regularization properties of the Ornstein-Uhlenbeck (OU) process, we show that weak log-concavity of the data distribution evolves into log-concavity over time. This transition is rigorously quantified through a PDE-based analysis of the Hamilton-Jacobi-Bellman equation governing the log-density of the forward process. Moreover, we establish that the drift of the time-reversed OU process alternates between contractive and non-contractive regimes, reflecting the dynamics of concavity.Our approach circumvents the need for stringent regularity conditions on the score function and its estimators, relying instead on milder, more practical assumptions. We demonstrate the wide applicability of this framework through explicit computations on Gaussian mixture models, illustrating its versatility and potential for broader classes of data distributions.
Paperid:1127
Authors:Johnny Xi · Hugh Dance · Peter Orbanz · Benjamin Bloem-Reddy
Abstract:
Bivariate structural causal models (SCM) are often used to infer causal direction by examining their goodnessof-fit under restricted model classes. In this paper, we describe a parametrization of bivariate SCMs in terms of acausal velocityby viewing the cause variable as time in a dynamical system. The velocity implicitly defines counterfactual curves via the solution of initial value problems where the observation specifies the initial condition. Using tools from measure transport, we obtain a unique correspondence between SCMs and the score function of the generated distribution via its causal velocity. Based on this, we derive an objective function that directly regresses the velocity against the score function, the latter of which can be estimated non-parametrically from observational data. We use this to develop a method for bivariate causal discovery that extends beyond known model classes such as additive or location-scale noise, and that requires no assumptions on the noise distributions. When the score is estimated well, the objective is also useful for detecting model non–identifiability and misspecification. We present positive results in simulation and benchmark experiments where many existing methods fail, and perform ablation studies to examine the method's sensitivity to accurate score estimation.
Paperid:1128
Authors:Chenchen Gu · Xiang Li · Rohith Kuditipudi · Percy Liang · Tatsunori Hashimoto
Abstract:
Prompt caching in large language models (LLMs) results in datadependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users' prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users' prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.
Paperid:1129
Authors:Mei Li · Yuxiang Lu · Qinyan Dai · Suizhi Huang · Yue Ding · Hongtao Lu
Abstract:
Continual Learning (CL) strives to learn incrementally across tasks while mitigating catastrophic forgetting. A key challenge in CL is balancing stability (retaining prior knowledge) and plasticity (learning new tasks). While representative gradient projection methods ensure stability, they often limit plasticity. Model merging techniques offer promising solutions, but prior methods typically rely on empirical assumptions and carefully selected hyperparameters. In this paper, we explore the potential of model merging to enhance the stabilityplasticity trade-off, providing theoretical insights that underscore its benefits. Specifically, we reformulate the merging mechanism using Bayesian continual learning principles and derive a closed-form solution for the optimal merging coefficient that adapts to the diverse characteristics of tasks. To validate our approach, we introduce a two-stage framework named BECAME, which synergizes the expertise of gradient projection and adaptive merging. Extensive experiments show that our approach outperforms state-of-the-art CL methods and existing merging strategies.
Paperid:1130
Authors:Zhenqiao Song · Tianxiao Li · Lei Li · Martin Min
Abstract:
Abstract:Designing proteinbinding proteins with high affinity is critical in biomedical research and biotechnology. Despite recent advancements targeting specific proteins, the ability to create high-affinity binders for arbitrary protein targets on demand, without extensive rounds of wet-lab testing, remains a significant challenge. Here, we introduce PPDiff, a diffusion model to jointly design the sequence and structure of binders for arbitrary protein targets in a non-autoregressive manner. PPDiff builds upon our developed Sequence Structure Interleaving Network with Causal attention layers (SSINC), which integrates interleaved self-attention layers to capture global amino acid correlations, $k$-nearest neighbor ($k$NN) equivariant graph convolutional layers to model local interactions in three-dimensional (3D) space, and causal attention layers to simplify the intricate interdependencies within the protein sequence. To assess PPDiff, we curate PPBench, a general protein-protein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBench and finetuned on two real-world applications: target-protein mini-binder complex design and antigen-antibody complex design. PPDiff consistently surpasses baseline methods, achieving success rates of 50.00\%, 23.16\%, and 16.89\% for the pretraining task and the two downstream applications, respectively.
Paperid:1131
Authors:Yizhou Zhang · Yian Ma · Eric Mazumdar
Abstract:
We consider the problem of learning to exploit learning algorithms through repeated interactions in games. Specifically, we focus on the case of repeated two player, finiteaction games, in which an optimizer aims to steer a no-regret learner to a Stackelberg equilibrium without knowledge of its payoffs. We first show that this is impossible if the optimizer only knows that the learner is using an algorithm from the general class of no-regret algorithms. This suggests that the optimizer requires more information about the learner's objectives or algorithm to successfully exploit them. Building on this intuition, we reduce the problem for the optimizer to that of recovering the learner's payoff structure. We demonstrate the effectiveness of this approach if the learner’s algorithm is drawn from a smaller class by analyzing two examples: one where the learner uses an ascent algorithm, and another where the learner uses stochastic mirror ascent with known regularizer and step sizes.
Paperid:1132
Authors:Qinggang Zhang · Hao Chen · Junnan Dong · Shengyuan Chen · Feiran Huang · Xiao Huang
Abstract:
Recent advancements in large language models (LLMs) have shown promise in bridging the gap between natural language queries and database management systems, enabling users to interact with databases without the background of SQL. However, LLMs often struggle to fully exploit and comprehend the user intention and complex structures of databases. Decompositionbased methods have been proposed to enhance the performance of LLMs on complex tasks, but decomposing SQL generation into subtasks is non-trivial due to the declarative structure of SQL syntax and the intricate connections between query concepts and database elements. In this paper, we propose a novel Structure GUided text-to-SQL framework ( SGU-SQL) that incorporates syntax-based prompting to enhance the SQL generation capabilities of LLMs. Specifically, SGU-SQL establishes structure-aware links between user queries and database schema and recursively decomposes the complex generation task using syntax-based prompting to guide LLMs in incrementally constructing target SQLs. Extensive experiments on two benchmark datasets demonstrate that SGU-SQL consistently outperforms state-of-the-art text-to-SQL baselines.
Paperid:1133
Authors:Yik Siu Chan · Narutatsu Ri · Yuxin Xiao · Marzyeh Ghassemi
Abstract:
Abstract:Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple humanLLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both *actionable* and *informative*---two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of $0.319$ in Attack Success Rate and $0.426$ in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.
Paperid:1134
Authors:Huayu Chen · Kai Jiang · Kaiwen Zheng · Jianfei Chen · Hang Su · Jun Zhu
Abstract:
ClassifierFree Guidance (CFG) has been a default technique in various visual generative models, yet it requires inference from both conditional and unconditional models during sampling. We propose to build visual models that are free from guided sampling. The resulting algorithm, Guidance-Free Training (GFT), matches the performance of CFG while reducing sampling to a single model, halving the computational cost. Unlike previous distillation-based approaches that rely on pretrained CFG networks, GFT enables training directly from scratch. GFT is simple to implement. It retains the same maximum likelihood objective as CFG and differs mainly in the parameterization of conditional models. Implementing GFT requires only minimal modifications to existing codebases, as most design choices and hyperparameters are directly inherited from CFG. Our extensive experiments across five distinct visual models demonstrate the effectiveness and versatility of GFT. Across domains of diffusion, autoregressive, and masked-prediction modeling, GFT consistently achieves comparable or even lower FID scores, with similar diversity-fidelity trade-offs compared with CFG baselines, all while being guidance-free.
Paperid:1135
Authors:Chi-Ning Chou · Hang Le · Yichen Wang · SueYeon Chung
Abstract:
The ability to integrate taskrelevant information into neural representations is a fundamental aspect of both human and machine intelligence. Recent theoretical studies have focused on investigating whether a network has learned task-relevant features (i.e., therichregime) or not (i.e., thelazyregime, where the trained network is equivalent to a random feature model). However, the simple lazy-versus-rich dichotomy overlooks the potential for various subtypes of feature learning driven by variations in learning algorithms, network architectures, and data properties. In this work, we present a framework to study feature learning through the representational geometry of neural manifolds. Instead of directly analyzing the learned features, we focus on characterizing how task-relevant manifolds evolve during feature learning. In particular, we find in both theoretical and empirical setting that task-relevant manifolds untangle when a network learns features, and the changes in manifold geometry uncover distinct learning stages and strategies beyond the lazy-versus-rich dichotomy. Our approach provides novel insights into feature learning across neuroscience and machine learning, including structural inductive biases in neural circuits and out-of-distribution generalization.
Paperid:1136
Authors:Rachith Aiyappa · Xin Wang · Munjung Kim · Ozgur Can Seckin · Yong-Yeol Ahn · Sadamori Kojaku
Abstract:
Link predictiona task of distinguishing actual hidden edges from random unconnected node pairs---is one of the quintessential tasks in graph machine learning. Despite being widely accepted as a universal benchmark and a downstream task for representation learning, the link prediction benchmark's validity has rarely been questioned. Here, we show that the common edge sampling procedure in the link prediction task has an implicit bias toward high-degree nodes. This produces a highly skewed evaluation that favors methods overly dependent on node degree. In fact a ``null'' link prediction method based solely on node degree can yield nearly optimal performance in this setting. We propose a degree-corrected link prediction benchmark that offers a more reasonable assessment and better aligns with the performance on the recommendation task. Finally, we demonstrate that the degree-corrected benchmark can more effectively train graph machine-learning models by reducing overfitting to node degrees and facilitating the learning of relevant structures in graphs.
Paperid:1137
Authors:Xinxu Wei · kanhao zhao · Yong Jiao · Hua Xie · Lifang He · Yu Zhang
Abstract:
Effectively utilizing extensive unlabeled highdensity EEG data to improve performance in scenarios with limited labeled low-density EEG data presents a significant challenge. In this paper, we address this challenge by formulating it as a graph transfer learning and knowledge distillation problem. We propose a Unified Pre-trained Graph Contrastive Masked Autoencoder Distiller, named EEG-DisGCMAE, to bridge the gap between unlabeled and labeled as well as high- and low-density EEG data. Our approach introduces a novel unified graph self-supervised pre-training paradigm, which seamlessly integrates the graph contrastive pre-training with the graph masked autoencoder pre-training. Furthermore, we propose a graph topology distillation loss function, allowing a lightweight student model trained on low-density data to learn from a teacher model trained on high-density data during pre-training and fine-tuning. This method effectively handles missing electrodes through contrastive distillation. We validate the effectiveness of EEG-DisGCMAE across four classification tasks using two clinical EEG datasets with abundant data.
Paperid:1138
Authors:Philipp Höllmer · Thomas Egg · Maya Martirossyan · Eric Fuemmeler · Amit Gupta · Zeren Shui · Pawan Prakash · Adrian Roitberg · Mingjie Liu · George Karypis · Mark Transtrum · Richard Hennig · Ellad Tadmor · Stefano Martiniani
Abstract:
The discovery of new materials is essential for enabling technological advancements. Computational approaches for predicting novel materials must effectively learn the manifold of stable crystal structures within an infinite design space. We introduce Open Materials Generation (OMG), a stateof-the-art framework for the generative design and discovery of inorganic crystalline materials. OMG employs stochastic interpolants (SI) to bridge an arbitrary base distribution to the target distribution of inorganic crystals via a broad class of tunable stochastic processes, encompassing both diffusion models and flow matching as special cases. In this work, we adapt the SI framework by integrating an equivariant graph representation of crystal structures and extending it to account for periodic boundary conditions in unit cell representations. Additionally, we couple the SI flow over spatial coordinates and lattice vectors with discrete flow matching for atomic species. We benchmark OMG's performance on two tasks: Crystal Structure Prediction (CSP) for specified compositions and ‘de novo’ generation (DNG) aimed at discovering stable, novel, and unique structures. We validate our generated structures by performing structural relaxation with machine-learned force fields. In our ground-up implementation of OMG, we refine and extend both CSP and DNG metrics compared to previous works. OMG establishes a new state-of-the-art in generative modeling for materials discovery, outperforming purely flow-based and diffusion-based implementations. These results underscore the importance of designing flexible deep learning frameworks to accelerate progress in materials science.
Paperid:1139
Authors:Puning Zhao · Rongfei Fan · Shaowei Wang · Li Shen · Qixin Zhang · Zong Ke · Tianhang Zheng
Abstract:
Abstract:Nonparametric contextual bandit is an important model of sequential decision making problems. Under $\alpha$Tsybakov margin condition, existing research has established a regret bound of $\tilde{O}\left(T^{1-\frac{\alpha+1}{d+2}}\right)$ for bounded supports. However, the optimal regret with unbounded contexts has not been analyzed. The challenge of solving contextual bandit problems with unbounded support is to achieve both exploration-exploitation tradeoff and bias-variance tradeoff simultaneously. In this paper, we solve the nonparametric contextual bandit problem with unbounded contexts. We propose two nearest neighbor methods combined with UCB exploration. The first method uses a fixed $k$. Our analysis shows that this method achieves minimax optimal regret under a weak margin condition and relatively light-tailed context distributions. The second method uses adaptive $k$. By a proper data-driven selection of $k$, this method achieves an expected regret of $\tilde{O}\left(T^{1-\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta}}+T^{1-\beta}\right)$, in which $\beta$ is a parameter describing the tail strength. This bound matches the minimax lower bound up to logarithm factors, indicating that the second method is approximately optimal.
Paperid:1140
Authors:Thomas Lee · William Toner · Rajkarn Singh · Artjom Joosen · Martin Asenov
Abstract:
Foundation models (FMs) have emerged as a promising approach for time series forecasting. While effective, FMs typically remain fixed during deployment due to the high computational costs of learning them online. Consequently, deployed FMs fail to adapt their forecasts to current data characteristics, despite the availability of online feedback from newly arriving data. This raises the question of whether FM performance can be enhanced by theefficientusage of this feedback. We proposeAdapTSto answer this question.AdapTSis a lightweight mechanism for the online adaption of FM forecasts in response to online feedback. AdapTS consists of two parts:a)theAdapTSForecasterwhich is used to learn the current data distribution; andb)theAdapTS-Weighterwhich is used to combine the forecasts of the FM and the AdapTS-Forecaster. We evaluate the performance of AdapTS in conjunction with several recent FMs across a suite of standard time series datasets. Inallof our experiments we find that using AdapTS improves performance. This work demonstrates how efficient usage of online feedback can be used to improve FM forecasts.
Paperid:1141
Authors:Aoran Wang · Xinnan Dai · Jun Pang
Abstract:
Abstract:Existing methods for inferring latent relational structures struggle to integrate partial prior knowledge, such as known edges, nodedegree constraints, and global sparsity, without destabilizing training or conflicting with probabilistic assumptions. We propose Soft-Gated Structural Inference (SGSI), a VAE framework that seamlessly incorporates domain constraints via (1) soft gating with learnable edge masks to preserve gradients, (2) cloning-clamping of deterministic edges to avoid distributional conflicts, and (3) adaptive regularization to balance data-driven learning with domain constraints. By excluding known edges from stochastic inference, SGSI reallocates capacity to uncertain interactions, optimizing the information bottleneck trade-off. Experiments on 16 datasets show SGSI improves edge recovery by up to $9$\% AUROC over baselines, scales to larger graphs ($94.2$\% AUROC), and maintains stable training. SGSI bridges domain expertise with data-driven learning, enabling interpretable and robust structural discovery in dynamical systems.
Paperid:1142
Authors:Dejia Xu · Yifan Jiang · Chen Huang · Liangchen Song · Thorsten Gernoth · Liangliang Cao · Zhangyang “Atlas” Wang · Hao Tang
Abstract:
In recent years there have been remarkable breakthroughs in imageto-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To our best knowledge, Cavia is the first of its kind that allows the user to precisely specify camera motion while obtaining object motion. To the best of our knowledge, Cavia is the first framework that enables users to generate multiple videos of the same scene with precise control over camera motion, while simultaneously preserving object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality.
Paperid:1143
Authors:Shuang Zeng · Yunwen Lei
Abstract:
Decentralized SGD (DSGD) is a popular optimization method to train large-scale machine learning models. In this paper, we study the generalization behavior of D-SGD for both smooth and nonsmooth problems by leveraging the algorithm stability. For convex and smooth problems, we develop stability bounds involving the training errors to show the benefit of optimization in generalization. This improves the existing results by removing the Lipschitzness assumption and implying fast rates in a low-noise condition. We also develop the first optimal stability-based generalization bounds for D-SGD applied to nonsmooth problems. We further develop optimization error bounds which imply minimax optimal excess risk rates. Our novelty in the analysis consists of an error decomposition to use the co-coercivity of functions as well as the control of a neighboring-consensus error.
Paperid:1144
Authors:HamidReza Imani · Jiaxin Peng · Peiman Mohseni · Abdolah Amirany · Tarek El-Ghazawi
Abstract:
The deployment of mixtureof-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands and computational complexity. These challenges are further exacerbated in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single-GPU. We propose a novel serving system that employs \textit{similarity-based expert consolidation} to reduce memory overhead by sharing similar experts across models. To ensure output quality, we introduce \textit{runtime partial reconfiguration}, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves a competitive output quality while maintaining throughput comparable to serving a single model while incurring only a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85\% average reduction in turnaround time compared to NVIDIA’s multi-instance GPU (MIG), highlighting the efficiency of our approach.
Paperid:1145
Authors:Tian Bai · Yue Zhao · Xiang Yu · Archer Yang
Abstract:
Abstract:Selecting highquality candidates from large datasets is critical in applications such as drug discovery, precision medicine, and alignment of large language models (LLMs). While Conformal Selection (CS) provides rigorous uncertainty quantification, it is limited to univariate responses and scalar criteria. To address this, we propose Multivariate Conformal Selection (mCS), a generalization of CS designed for multivariate response settings. Our method introduces regional monotonicity and employs multivariate nonconformity scores to construct conformal $p$-values, enabling finite-sample False Discovery Rate (FDR) control. We present two variants: $\texttt{mCS-dist}$, using distance-based scores, and $\texttt{mCS-learn}$, which learns optimal scores via differentiable optimization. Experiments on simulated and real-world datasets demonstrate that mCS significantly improves selection power while maintaining FDR control, establishing it as a robust framework for multivariate selection tasks.
Paperid:1146
Authors:YUSONG WANG · Shiyin Tan · Jialun Shen · Yicheng Xu · Haobo Song · Qi Xu · Prayag Tiwari · Mingkun Xu
Abstract:
Graph Contrastive Learning (GCL) improves Graph Neural Network (GNN)based protein representation learning by enhancing its generalization and robustness. Existing GCL approaches for protein representation learning rely on 2D topology, where graph augmentation is solely based on topological features, ignoring the intrinsic biological properties of proteins. Besides, 3D structure-based protein graph augmentation remains unexplored, despite proteins inherently exhibiting 3D structures. To bridge this gap, we propose novel biology-aware graph augmentation strategies from the perspective of invariance and integrate them into the protein GCL framework. Specifically, we introduce Functional Community Invariance (FCI)-based graph augmentation, which employs spectral constraints to preserve topology-driven community structures while incorporating residue-level chemical similarity as edge weights to guide edge sampling and maintain functional communities. Furthermore, we propose 3D Protein Structure Invariance (3-PSI)-based graph augmentation, leveraging dihedral angle perturbations and secondary structure rotations to retain critical 3D structural information of proteins while diversifying graph views.Extensive experiments on four different protein-related tasks demonstrate the superiority of our proposed GCL protein representation learning framework.
Paperid:1147
Authors:Benjamin Feuer · Chinmay Hegde
Abstract:
Language model (LLM) posttraining can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WildChat-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating Re-Wild, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples.
Paperid:1148
Authors:Sebastian Farquhar · Vikrant Varma · David Lindner · David Elson · Caleb Biddulph · Ian Goodfellow · Rohin Shah
Abstract:
Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multistep plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behavior is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.
Paperid:1149
Authors:Taiyu Ban · Changxin Rong · Xiangyu Wang · Lyuzhou Chen · Xin Wang · Derui Lyu · Qinrui Zhu · Huanhuan Chen
Abstract:
Differentiable structure learning of causal directed acyclic graphs (DAGs) is an emerging field in causal discovery, leveraging powerful neural learners. However, the incorporation of ancestral constraints, essential for representing abstract prior causal knowledge, remains an open research challenge. This paper addresses this gap by introducing a generalized framework for integrating ancestral constraints. Specifically, we identify two key issues: the nonequivalence of relaxed characterizations for representing path existence and order violations among paths during optimization. In response, we propose a binary-masked characterization method and an order-guided optimization strategy, tailored to address these challenges. We provide theoretical justification for the correctness of our approach, complemented by experimental evaluations on both synthetic and real-world datasets.
Paperid:1150
Authors:Rui Yao · Constantinos Daskalakis · Vardis Kandiros
Abstract:
Abstract:We study the problem of learning the topology of a directed Gaussian Graphical Model under the equalvariance assumption, where the graph has $n$ nodes and maximum in-degree $d$. Prior work has established that $O(d \log n)$ samples are sufficient for this task. However, an important factor that is often overlooked in these analyses is the dependence on the condition number of the covariance matrix of the model. Indeed, all algorithms from prior work require a number of samples that grows polynomially with this condition number. In many cases this is unsatisfactory, since the condition number could grow polynomially with $n$, rendering these prior approaches impractical in high-dimensional settings. In this work, we provide an algorithm that recovers the underlying graph and prove that the number of samples required is independent of the condition number. Furthermore, we establish lower bounds that nearly match the upper bound up to a $d$-factor, thus providing an almost tight characterization of the true sample complexity of the problem. Moreover, under a further assumption that all the variances of the variables are bounded, we design a polynomial-time algorithm that recovers the underlying graph, at the cost of an additional polynomial dependence of the sample complexity on $d$. We complement our theoretical findings with simulations on synthetic datasets that confirm our predictions.
Paperid:1151
Authors:Mykhailo Uss · Ruslan Yermolenko · Oleksii Shashko · Olena Kolodiazhna · Ivan Safonov · Volodymyr Savin · Yoonjae Yeo · Seowon Ji · Jaeyun Jeong
Abstract:
Dense depth prediction deep neural networks (DNN) have achieved impressive results for both monocular and binocular data but they are limited by high computational complexity, restricting their use on lowend devices. For better on-device efficiency and hardware utilization, weights and activations of the DNN should be converted to low-bit precision. However, this precision is not sufficient for representing high dynamic range depth. In this paper, we aim to overcome this limitation and restore high-precision depth from low-bit precision predictions. To achieve this, we propose to represent high dynamic range depth as two low dynamic range components of a Hilbert curve, and to train the full precision DNN to directly predict the latter. For on-device deployment, we use standard quantization methods and add a post-processing step that reconstructs depth from the Hilbert curve components predicted in low-bit precision. Extensive experiments demonstrate that our method increases bit precision of predicted depth by up to three bits with little computational overhead. We also observe a positive side effect of quantization error reduction by up to five times. Our method enables effective and accurate depth prediction with DNN weights and activations quantized to eight bit precision.
Paperid:1152
Authors:Yeyun Chen · Jiangming Shi
Abstract:
The progress in materials science and drug discovery is impeded by the availability of labeled data and the high costs of manual annotation, driving the need for efficient strategies to capture molecular representations and enable accurate predictions. Pretrained Graph Neural Networks have shown promise in capturing universal molecular representations, but adapting them to taskspecific applications remains challenging. In this paper, we propose Multilevel Informed Prompt-Tuning (MIPT), a novel framework for effectively tailoring pretrained models to molecule-related tasks. MIT utilizes a lightweight, multi-level prompt learning module to capture node-level and graph-level task-specific knowledge, ensuring adaptable and efficient tuning. Additionally, a noise penalty mechanism is introduced to address mismatches between pretrained representations and downstream tasks, reducing irrelevant or noisy information. Experimental results show that MIPT surpasses all baselines, aligning graph space and task space while achieving significant improvements in molecule-related tasks, demonstrating its scalability and versatility for molecular tasks.
Paperid:1153
Authors:Yunli Wang · ZhenZhang · Zhiqiang Wang · Zixuan Yang · Yu Li · Jian Yang · Shiyang Wen · Peng Jiang · Kun Gai
Abstract:
Cascade Ranking is a prevalent architecture in largescale top-k selection systems like recommendation and advertising platforms. Traditional training methods focus on single-stage optimization, neglecting interactions between stages. Recent advances such as RankFlow and FS-LTR have introduced interaction-aware training paradigms but still struggle to align training objectives across the entire cascade and ensure robust performance for models of varying capacities. To address these challenges, we propose LCRON, which introduces a novel surrogate loss function derived from the lower bound probability that ground truth items are selected by cascade ranking, ensuring alignment with the overall objective of the system. According to the properties of the derived bound, we further design an auxiliary loss for each stage to drive the reduction of this bound, leading to a more robust and effective top-k selection. LCRON enables end-to-end training of the entire cascade ranking system as a unified network. Experimental results demonstrate that LCRON achieves significant improvement over existing methods on public benchmarks and industrial applications, addressing key limitations in cascade ranking training and significantly enhancing system performance.
Paperid:1154
Authors:Mingquan Feng · Weixin Liao · Yixin Huang · Yifan Fu · Qifu Zheng · Junchi Yan
Abstract:
Neural operators have emerged as a promising approach for solving highdimensional partial differential equations (PDEs). However, existing neural operators often have difficulty in dealing with constrained PDEs, where the solution must satisfy additional equality or inequality constraints beyond the governing equations. To close this gap, we propose a novel neural operator, Hyper Extended Adaptive PDHG (HEAP) for constrained high-dim PDEs, where the learned operator evolves in the parameter space of PDEs. We first show that the evolution operator learning can be formulated as a quadratic programming (QP) problem, then unroll the adaptive primal-dual hybrid gradient (APDHG) algorithm as the QP-solver into the neural operator architecture. This allows us to improve efficiency while retaining theoretical guarantees of the constrained optimization. Empirical results on a variety of high-dim PDEs show that HEAP outperforms the state-of-the-art neural operator model.
Paperid:1155
Authors:Ben Lonnqvist · Elsa Scialom · Abdulkadir Gokce · Zehra Merchant · Michael Herzog · Martin Schrimpf
Abstract:
Abstract:Despite the tremendous success of deep learning in computer vision, models still fall behind humans in generalizing to new input distributions. Existing benchmarks do not investigate the specific failure points of models by analyzing performance under many controlled conditions. Our study systematically dissects where and why models struggle with contour integration a hallmark of human vision -- by designing an experiment that tests object recognition under various levels of object fragmentation. Humans (n=50) perform at high accuracy, even with few object contours present. This is in contrast to models which exhibit substantially lower sensitivity to increasing object contours, with most of the over 1,000 models we tested barely performing above chance. Only at very large scales ($\sim5B$ training dataset size) do models begin to approach human performance. Importantly, humans exhibit an integration bias - a preference towards recognizing objects made up of directional fragments over directionless fragments. We find that not only do models that share this property perform better at our task, but that this bias also increases with model training dataset size, and training models to exhibit contour integration leads to high shape bias. Taken together, our results suggest that contour integration is a hallmark of object vision that underlies object recognition performance, and may be a mechanism learned from data at scale.
Paperid:1156
Authors:Ajay Jaiswal · Yifan Wang · Lu Yin · Shiwei Liu · Runjin Chen · Jiawei Zhao · Ananth Grama · Yuandong Tian · Zhangyang “Atlas” Wang
Abstract:
Large Language Models (LLMs) matrices can often be expressed in lowrank format with potential to relax memory and compute resource requirements. Unlike previous works which pivot around developing novel matrix decomposition algorithms, in this work we focus to study the emerging non-uniform low-rank properties across weight matrices in LLMs through the lens of stabilizing gradient subspace. \textit{Firstly,} we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. \textit{Secondly,} we empirically establish a consequential relationship between the gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present \textit{Weight Low-Rank Projection} \textbf{(WeLore)} that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient dynamics perspective illustrate that \textit{LRCs tend to have better finetuning capabilities} and their standalone finetuning can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. All codes and checkpoints will be released.
Paperid:1157
Authors:Mislav Balunovic · Jasper Dekoninck · Nikola Jovanović · Ivo Petrov · Martin Vechev
Abstract:
While Large Language Models (LLMs) demonstrate impressive performance in mathematics, existing math benchmarks come with significant limitations. Many focus on problems with fixed groundtruth answers, and are often saturated due to problem simplicity or the viability of guessing or memorization. Crucially, they capture only a narrow subset of relevant math problems. To address this research gap, we introduce MathConstruct, a new benchmark of 127 challenging problems sourced from various math competitions, which targetsconstructive proofs, a widely encountered problem type requiring the construction of mathematical objects with specific properties. These proofs are particularly suitable for LLM evaluation, as solution correctness can be easily verified. Our automated verifiers also enable MathConstruct to generate problem variations, used to evaluate robustness. State-of-the-art LLMs solve only 41\% of MathConstruct problems, highlighting its complexity and importance for LLM evaluation.
Paperid:1158
Authors:Zicheng Lin · Tian Liang · Jiahao Xu · Qiuzhi Liu · Xing Wang · Ruilin Luo · Chufan Shi · Siheng Li · Yujiu Yang · Zhaopeng Tu
Abstract:
Mathematical reasoning tasks pose significant challenges for large language models (LLMs) because they require precise logical deduction and sequence analysis. In this work, we introduce the concept of critical tokens elements within reasoning trajectories that significantly influence incorrect outcomes. We present a novel framework for identifying these tokens through rollout sampling and demonstrate their substantial divergence from traditional error tokens. Through extensive experiments on datasets such as GSM8K and MATH500, we show that identifying and replacing critical tokens significantly improves model accuracy. We propose an efficient methodology for pinpointing these tokens in large-scale datasets using contrastive estimation and extend this framework to enhance model training processes with direct preference optimization (DPO). Experimental results on GSM8K and MATH500 benchmarks with the widely used models Llama-3 (8B and 70B) and Deepseek-math (7B) demonstrate the effectiveness of the proposed approach, cDPO. Our results underscore the potential of leveraging critical tokens to reduce errors in reasoning tasks, advancing the development of AI systems capable of robust logical deduction.
Paperid:1159
Authors:Haozhao Wang · Shengyu Wang · Jiaming Li · Hao Ren · Xingshuo Han · Wenchao Xu · Shangwei Guo · Tianwei Zhang · Ruixuan Li
Abstract:
Abstract:Semisupervised Federated Learning (SSFL) is a promising approach that allows clients to collaboratively train a global model in the absence of their local data labels. The key step of SSFL is the re-labeling where each client adopts two types of available models, namely global and local models, to re-label the local data. While various technologies such as using the global model or the average of two models have been proposed to conduct the re-labeling step, little literature delves deeply into the performance dominance and limitations of the two models. In this paper, we first theoretically and empirically demonstrate that the local model achieves higher re-labeling accuracy over local data while the global model can progressively improve the re-labeling performance by introducing the extra data knowledge of other clients. Based on these findings, we propose BSemiFL which re-labels the local data through the collaboration between the local and global model in a Bayesian approach. Specifically, to re-label any given local sample, BSemiFL first uses Bayesian inference to assess the closeness of the local/global model to the sample. Then, it applies a weighted combination of their pseudo labels, using the closeness as the weights. Theoretical analysis shows that the labeling error of our method is smaller than that of simply using the global model, the local model, or their simple average. Experimental results show that BSemiFL improves the performance by up to $9.8\%$ as compared to state-of-the-art methods.
Paperid:1160
Authors:Euijin You · Hyang-Won Lee
Abstract:
Fast adversarial training (FAT) aims to enhance the robustness of models against adversarial attacks with reduced training time, however, FAT often suffers from compromised robustness due to insufficient exploration of adversarial space. In this paper, we develop a loss function to mitigate the problem of degraded robustness under FAT. Specifically, we derive a quadratic upper bound (QUB) on the adversarial training (AT) loss function and propose to utilize the bound with existing FAT methods. Our experimental results show that applying QUB loss to the existing methods yields significant improvement of robustness. Furthermore, using various metrics, we demonstrate that this improvement is likely to result from the smoothened loss landscape of the resulting model.
Paperid:1161
Authors:Jinyang Li · Nan Huo · Yan Gao · Jiayi Shi · Yingxiu Zhao · Qu Ge · Bowen Qin · Yurong Wu · Xiaodong Li · Chenhao Ma · Jian-Guang Lou · Reynold Cheng
Abstract:
Conversational Tabular Data Analysis, a collaboration between humans and machines, enables realtime data exploration for informed decision-making. The challenges and costs of collecting realistic conversational logs for tabular data analysis hinder comprehensive quantitative evaluation of Large Language Models (LLMs) in this task. To mitigate this issue, we introduceCoTA, a new benchmark to evaluate LLMs on conversational tabular data analysis.CoTAcontains 1013 conversations, covering 4 practical scenarios: Normal, Action, Private, and Private Action. Notably,CoTAis constructed by an economical multi-agent environment, Decision Company, with few human efforts. This environment ensures efficiency and scalability of generating new conversational data. Our comprehensive study, conducted by data analysis experts, demonstrates that Decision Company is capable of producing diverse and high-quality data, laying the groundwork for efficient data annotation. We evaluate popular and advanced LLMs inCoTA, which highlights the challenges of conversational tabular data analysis. Furthermore, we propose Adaptive Conversation Reflection (ACR), a self-generated reflection strategy that guides LLMs to learn from successful histories. Experiments demonstrate that ACR can evolve LLMs into effective conversational data analysis agents, achieving a relative performance improvement of up to 35.14%.
Paperid:1162
Authors:Chris Hays · Manish Raghavan
Abstract:
Researchers and practitioners often wish to conduct causal inference in settings where units interfere via markets and recommendation systems. In these settings, units are affected by certainsharedstate variables, like prices, algorithmic recommendations or social signals. We formalize this structure, calling itshared-state interference, and argue that our formulation captures many relevant applied settings. Our key modeling assumption is that individuals' potential outcomes are independent conditional on the shared state. We then prove extensions of double machine learning (DML) theorems providing conditions for achieving efficient inference under shared-state interference. We also instantiate our general theorem in several models of interest where it is possible to efficiently estimate the average direct effect (ADE) or global average treatment effect (GATE).
Paperid:1163
Authors:Fei Long · Xiaoou Li · jiaming Lv · Yang Haoyuan · Xianjun Cheng · Peihua Li
Abstract:
Bridging contrastive languageimage pre-training (CLIP) to video action recognition has recently attracted significant interest. Human actions are inherently rich in spatial and temporal contexts, involving dynamic interactions among people, objects, and the environment. Accurately recognizing actions requires effectively capturing these fine-grained elements in videos and modeling their relationships with language. However, most existing methods rely on cosine similarity between global tokens for video-language alignment, which has limited ability to model complex relations while overlooking local tokens that encode fine-grained spatio-temporal cues. To address these downsides, we propose a novel framework called BDC-CLIP, whose key idea is Brownian Distance Covariance (BDC) for aligning video and language. BDC-CLIP leverages all visual and textual tokens for modeling both linear and non-linear relations in multimodal embedding space, thereby capturing rich contexts in space, time and language. Our BDC-CLIP achieves state-of-the-art performance across zero-shot, few-shot, base-to-novel, and fully-supervised action recognition, demonstrating its effectiveness and versatility.
Paperid:1164
Authors:Haojun Chen · Minghao Liu · Xiaojian Ma · Zailin Ma · Huimin Wu · Chengdong Ma · Yuanpei Chen · Yifan Zhong · Mingzhi Wang · Qing Li · Yaodong Yang
Abstract:
Diffusion policies are widely adopted in complex visuomotor tasks for their ability to capture multimodal action distributions. However, the multiple sampling steps required for action generation significantly harm realtime inference efficiency which limits their applicability in long-horizon tasks and real-time decision-making scenarios. Existing acceleration techniques reduce sampling steps by approximating the original denoising process but inevitably introduce unacceptable performance loss. Here we propose Falcon, which mitigates this trade-off and achieves further acceleration. The core insight is that visuomotor tasks exhibit sequential dependencies between actions at consecutive time steps. Falcon leverages this property to avoid denoising from a standard normal distribution at each decision step. Instead, it starts denoising from partial denoised actions derived from a history buffer to significantly reduce the denoising steps, while incorporating current observations to achieve performance-preserving acceleration of action generation. Importantly, Falcon is a training-free algorithm that can be applied as a plug-in to further improve decision efficiency on top of existing acceleration techniques. We validated Falcon in 46 simulated environments, demonstrating a 2-7x speedup with negligible performance degradation, offering a promising direction for efficient visuomotor policy design.
Paperid:1165
Authors:Milad Khademi Nori · Il-Min Kim · Guanghui Wang
Abstract:
Abstract:In classincremental learning (CIL), effective incremental learning strategies are essential to mitigate task confusion and catastrophic forgetting, especially as the number of tasks $t$ increases. Current exemplar replay strategies impose $\mathcal{O}(t)$ memory/compute complexities. We propose an autoencoder-based hybrid replay (AHR) strategy that leverages our new hybrid autoencoder (HAE) to function as a compressor to alleviate the requirement for large memory, achieving $\mathcal{O}(0.1 t)$ at the worst case with the computing complexity of $\mathcal{O}(t)$ while accomplishing state-of-the-art performance. The decoder later recovers the exemplar data stored in the latent space, rather than in raw format. Additionally, HAE is designed for both discriminative and generative modeling, enabling classification and replay capabilities, respectively. HAE adopts the charged particle system energy minimization equations and repulsive force algorithm for the incremental embedding and distribution of new class centroids in its latent space. Our results demonstrate that AHR consistently outperforms recent baselines across multiple benchmarks while operating with the same memory/compute budgets. The source code is included in the supplementary material and will be open-sourced upon publication.
Paperid:1166
Authors:Luis Scoccola · Uzu Lim · Heather Harrington
Abstract:
Classical unsupervised learning methods like clustering and linear dimensionality reduction parametrize largescale geometrywhen it is discrete or linear, while more modern methods from manifold learning find low dimensional representation or infer local geometry by constructing a graph on the input data. More recently, topological data analysis popularized the use of simplicial complexes to represent data topology with two main methodologies: topological inference with geometric complexes and large-scale topology representation with Mapper graphs -- central to these is the nerve construction from topology, which builds a simplicial complex given any cover of a space by subsets. While successful, these have limitations: geometric complexes scale poorly with data size, and Mapper graphs can be hard to tune and only contain low dimensional information. In this paper, we propose to study the problem of learning covers in its own right, and from the perspective of optimization. We describe a method to learn topologically-faithful covers of geometric datasets, and show that the simplicial complexes thus obtained can outperform standard topological inference approaches in terms of size, and Mapper-type algorithms in terms of representation of large-scale topology.
Paperid:1167
Authors:Tom Labiausse · Laurent Mazaré · Edouard Grave · Alexandre Défossez · Neil Zeghidour
Abstract:
We introduce Hibiki, a decoderonly model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart --where one waits for the end of the source utterance to start translating-- adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples on https://hibiki-s2st.github.io.
Paperid:1168
Authors:Claas Voelcker · Anastasiia Pedan · Arash Ahmadian · Romina Abachi · Igor Gilitschenski · Amir-massoud Farahmand
Abstract:
The idea of valueaware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcement learning.The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature.However, theoretical investigation into its strengths and weaknesses is limited.In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss.We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function.Building on this insight, we propose corrections to solve this issue.Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents.We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.
Paperid:1169
Authors:Junlong Li · Daya Guo · Dejian Yang · Runxin Xu · Yu Wu · Junxian He
Abstract:
Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextuallygrounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives—like logic flow planning, state-space searching, decision tree traversal, and modular decomposition—while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models will be publicly available.
Paperid:1170
Authors:Yuguang Yan · Zongyu Li · Haolin Yang · Zeqin Yang · Hao Zhou · Ruichu Cai · Zhifeng Hao
Abstract:
Causal inference seeks to estimate the effect given a treatment such as a medicine or the dosage of a medication. To reduce the confounding bias caused by the nonrandomized treatment assignment, most existing methods reduce the shift between subpopulations receiving different treatments. However, these methods split limited training samples into smaller groups, which cuts down the number of samples in each group, while precise distribution estimation and alignment highly rely on a sufficient number of training samples. In this paper, we propose a distribution alignment paradigm without data splitting, which can be naturally applied in the settings of binary and continuous treatments. To this end, we characterize the confounding bias by considering different probability measures of the same set including all the training samples, and exploit the optimal transport theory to analyze the confounding bias and outcome estimation error. Based on this, we propose to learn balanced representations by reducing the bias between the marginal distribution and the conditional distribution of a treatment. As a result, data reduction caused by splitting is avoided, and the outcome prediction model trained on one treatment group can be generalized to the entire population. The experiments on both binary and continuous treatment settings demonstrate the effectiveness of our method.
Paperid:1171
Authors:Minghao Xu · Jiaze Song · Wu Keming · Xiangxin Zhou · Bin Cui · Wentao Zhang
Abstract:
Understanding the various properties of glycans with machine learning has shown some preliminary promise. However, previous methods mainly focused on modeling the backbone structure of glycans as graphs of monosaccharides (i.e., sugar units), while they neglected the atomic structures underlying each monosaccharide, which are actually important indicators of glycan properties. In this work, we fill this blank by introducing the GlycanAA model for AllAtom-wise Glycan modeling. GlycanAA models a glycan as a heterogeneous graph with monosaccharide nodes representing its global backbone structure and atom nodes representing its local atomic-level structures. Based on such a graph, GlycanAA performs hierarchical message passing to capture from local atomic-level interactions to global monosaccharide-level interactions hierarchically. To further enhance the model capability, we pre-train GlycanAA on a high-quality unlabeled glycan dataset in a self-supervised way, deriving the PreGlycanAA model. Specifically, we design a multi-scale mask prediction algorithm to endow the model with knowledge about different levels of dependencies in a glycan. Extensive benchmark results show the superiority of GlycanAA over existing glycan encoders and verify the further improvements achieved by PreGlycanAA.
Paperid:1172
Authors:Yang Xu · Wenbin Lu · Rui Song
Abstract:
Interference, a key concept in causal inference, extends the reward modeling process by accounting for the impact of one unit's actions on the rewards of others. In contextual bandit (CB) settings where multiple units are present in the same round, interference can significantly affect the estimation of expected rewards for different arms, thereby influencing the decisionmaking process. Although some prior work has explored multi-agent and adversarial bandits in interference-aware settings, how to model interference in CB remains significantly underexplored. In this paper, we introduce a systematic framework to address interference in Linear CB (LinCB), bridging the gap between causal inference and online decision-making. We propose a series of algorithms that explicitly quantify the interference effect in the reward modeling process and provide comprehensive theoretical guarantees, including sublinear regret bounds, finite sample upper bounds, and asymptotic properties. The effectiveness of our approach is demonstrated through simulations and a synthetic data generated based on MovieLens data.
Paperid:1173
Authors:Prakash Palanivelu Rajmohan · Fred Roosta
Abstract:
While normbased and leverage-score-based methods have been extensively studied for identifying "important" data points in linear models, analogous tools for nonlinear models remain significantly underdeveloped. By introducing the concept of the adjoint operator of a nonlinear map, we address this gap and generalize norm-based and leverage-score-based importance sampling to nonlinear settings. We demonstrate that sampling based on these generalized notions of norm and leverage scores provides approximation guarantees for the underlying nonlinear mapping, similar to linear subspace embeddings. As direct applications, these nonlinear scores not only reduce the computational complexity of training nonlinear models by enabling efficient sampling over large datasets but also offer a novel mechanism for model explainability and outlier detection. Our contributions are supported by both theoretical analyses and experimental results across a variety of supervised learning scenarios.
Paperid:1174
Authors:Dianwen Ng · Kun Zhou · Yi-Wen Chao · Zhiwei Xiong · Bin Ma · EngSiong Chng
Abstract:
Achieving highfidelity audio compression while preserving perceptual quality across diverse audio types remains a significant challenge in Neural Audio Coding (NAC). This paper introduces MUFFIN, a fully convolutional NAC framework that leverages psychoacoustically guided multi-band frequency reconstruction. Central to MUFFIN is the Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) mechanism, which quantizes latent speech across different frequency bands. This approach optimizes bitrate allocation and enhances fidelity based on psychoacoustic studies, achieving efficient compression with unique perceptual features that separate content from speaker attributes through distinct codebooks. MUFFIN integrates a transformer-inspired convolutional architecture with proposed modified snake activation functions to capture fine frequency details with greater precision. Extensive evaluations on diverse datasets (LibriTTS, IEMOCAP, GTZAN, BBC) demonstrate MUFFIN’s ability to consistently surpass existing performance in audio reconstruction across various domains. Notably, a high-compression variant achieves an impressive SOTA 12.5 kHz rate while preserving reconstruction quality. Furthermore, MUFFIN excels in downstream generative tasks, demonstrating its potential as a robust token representation for integration with large language models. These results establish MUFFIN as a groundbreaking advancement in NAC and as the first neural psychoacoustic coding system. Speech demos are available and codes \textbf{will be released}.\footnote{\url{https://demos46.github.io/muffin/}}.
Paperid:1175
Authors:Eric Elmoznino · Thomas Jiralerspong · Yoshua Bengio · Guillaume Lajoie
Abstract:
Compositionality is believed to be fundamental to intelligence. In humans, it underlies the structure of thought and language. In AI, it enables a powerful form of outof-distribution generalization, in which a model systematically adapts to novel combinations of known concepts. However, while we have strong intuitions about what compositionality is, there is currently no formal definition for it. Here, we propose such a definition called representational compositionality that is conceptually simple, quantitative, and grounded in algorithmic information theory. Intuitively, representational compositionality states that a compositional representation is both expressive and describable as a simple function of parts. We validate our definition on both real and synthetic data, and show how it unifies disparate intuitions from across the literature in both AI and cognitive science. Our definition has the potential to inspire the design of novel, theoretically-driven models that better capture the mechanisms of compositional thought.
Paperid:1176
Authors:Mohammadsadegh Khorasani · Saber Salehkaleybar · Negar Kiyavash · Matthias Grossglauser
Abstract:
Hierarchical reinforcement learning (HRL) improves the efficiency of longhorizon reinforcement-learning tasks with sparse rewards by decomposing the task into a hierarchy of subgoals. The main challenge of HRL is efficient discovery of the hierarchical structure among subgoals and utilizing this structure to achieve the final goal.We address this challenge by modeling the subgoal structure as a causal graph and propose a causal discovery algorithm to learn it. Additionally, rather than intervening on the subgoals at random during exploration, we harness the discovered causal model to prioritize subgoal interventions based on their importance in attaining the final goal. Thesetargeted interventions result in a significantly more efficient policy in terms of the training cost.Unlike previous work on causal HRL, which lacked theoretical analysis, we provide a formal analysis of the problem.Specifically, for tree structures and, for a variant of Erdős-Rényi random graphs, our approach results in remarkable improvements. Our experimental results on HRL tasks also illustrate that our proposed framework outperforms existing work in terms of training cost.
Paperid:1177
Authors:Hanzhang Wang · Qingyuan Ma
Abstract:
Typographic attacks are often attributed to multimodal pretrained visual models' ability to fuse textual semantics into visual representations, but how and where such interference occurs remains unclear. We investigate whether these models encode textual semantics through representation complexity, and identify the mechanisms by which text disrupts visual understanding. To decouple orthography from semantics, we introduce the ToT dataset, containing minimal pairs of words that either share semantics with distinct visual forms (synonyms) or match visual forms with conflicting semantics (paronyms). By analyzing layer-wise Intrinsic Dimension (ID), we reveal that early layers exhibit competing dynamics between orthographic features and semantics, while later layers improve semantic accuracy primarily through orthographic disambiguation. Crucially, semantically driven representations emerge only in the final block, challenging the assumption of progressive semantic understanding. These findings reveal how multimodal models simulate semantic representations through visual patterns, prompting a deeper reconsideration of the gap between visual perception and semantic understanding.
Paperid:1178
Authors:Fuying Wang · Jiacheng Xu · Lequan Yu
Abstract:
Electrocardiograms (ECGs) play a vital role in monitoring cardiac health and diagnosing heart diseases. However, traditional deep learning approaches for ECG analysis rely heavily on largescale manual annotations, which are both time-consuming and resource-intensive to obtain. To overcome this limitation, self-supervised learning (SSL) has emerged as a promising alternative, enabling the extraction of robust ECG representations that can be efficiently transferred to various downstream tasks. While previous studies have explored SSL for ECG pretraining and multi-modal ECG-language alignment, they often fail to capture the multi-scale nature of ECG signals. As a result, these methods struggle to learn generalized representations due to their inability to model the hierarchical structure of ECG data. To address this gap, we introduce MELP, a novel Multi-scale ECG-Language Pretraining (MELP) model that fully leverages hierarchical supervision from ECG-text pairs. MELP first pretrains a cardiology-specific language model to enhance its understanding of clinical text. It then applies three levels of cross-modal supervision—at the token, beat, and rhythm levels—to align ECG signals with textual reports, capturing structured information across different time scales. We evaluate MELP on three public ECG datasets across multiple tasks, including zero-shot ECG classification, linear probing, and transfer learning. Experimental results demonstrate that MELP outperforms existing SSL methods, underscoring its effectiveness and adaptability across diverse clinical applications. Our code is anonymously available at https://anonymous.4open.science/r/MM-ECG-FM-0A9B.
Paperid:1179
Authors:Anna Hedström · Salim I. Amoukou · Tom Bewley · Saumitra Mishra · Manuela Veloso
Abstract:
We propose mechanistic error reduction with abstention (MERA), a principled steering framework for error mitigation in language models (LMs) that performs selective and adaptive interventions guided by error estimators. Existing steering methods construct steering vectors heuristically, such as by contrasting activations from examples with and without a target property (e.g. toxic vs nontoxic outputs). Adding these vectors to the model’s activations has shown promise for amplifying or suppressing specific behaviours. However, they typically rely on fixed intervention strengths, often leading to understeering or oversteering, without principles to balance safety and effective model corrections. MERA addresses these limitations by (i) optimising the intervention direction and (ii) calibrating when and how much to steer using user-defined constraints. Experiments across diverse datasets and LM families demonstrate notable improvements in task accuracy, establishing MERA as a general-purpose, safe and efficient method for mechanistic steering.
Paperid:1180
Authors:Haoran Zhang · Aparna Balagopalan · Nassim Oufattole · Hyewon Jeong · Yan Wu · Jiacheng Zhu · Marzyeh Ghassemi
Abstract:
Large repositories of imagecaption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled instances. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions. However, beyond filtering based on image-caption embedding similarity, no prior works have proposed other methods to filter noisy multimodal data, or concretely assessed the impact of noisy captioning data on downstream training. In this work, we propose, theoretically justify, and empirically validate LEMoN, a method to automatically identify label errors in image-caption datasets. Our method leverages the multimodal neighborhood of image-caption pairs in the latent space of contrastively pretrained multimodal models to automatically identify label errors. Through empirical evaluations across eight datasets and ten baselines, we find that LEMoN outperforms the baselines by over 3% in label error detection, and that training on datasets filtered using our method improves downstream captioning performance by 2 BLEU points.
Paperid:1181
Authors:Yingzhen Yang
Abstract:
Abstract:Sharp generalization bound for neural networks trained by gradient descent (GD) is of central interest in statistical learning theory and deep learning. In this paper, we consider nonparametric regressionby an overparameterized two-layer NN trained by GD. We show that, if the neural network is trained by GD with early stopping, then the trained network renders a sharp rate of the nonparametric regression risk of $\cO(\eps_n^2)$, which is the same rate as that for the classical kernel regression trained by GD with early stopping, where $\eps_n$ is the critical population rate of the Neural Tangent Kernel (NTK) associated with the network and $n$ is the size of the training data. It is remarked that our result does not require distributional assumptions on the covariate as long as the covariate lies on the unit sphere, in a strong contrast with many existing results which rely on specific distributions such as the spherical uniform data distribution or distributions satisfying certain restrictive conditions.As a special case of our general result, when the eigenvalues of the associated NTKdecay at a rate of $\lambda_j \asymp j^{-\frac{d}{d-1}}$ for $j \ge 1$ which happens under certain distributional assumption such as the training features follow the spherical uniform distribution, we immediately obtain the minimax optimal rate of$\cO(n^{-\frac{d}{2d-1}})$, which is the major results of several existing works in this direction. The neural network width in our general result is lower bounded by a function of only $d$ and $\eps_n$, and such width does not depend on the minimum eigenvalue of the empirical NTK matrix whose lower bound usually requires additional assumptions on the training data.Our results are built upon two significant technical results which are of independent interest. First, uniform convergence to the NTK is established during the training process by GD, so that we can have a nice decomposition of the neural network function at any step of the GD into a function in the ReproducingKernel Hilbert Space associated with the NTK and an error function with a small $L^{\infty}$-norm. Second, local Rademacher complexity is employedto tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD. Our resultformally fills the gap between training a classical kernel regression model and training an over-parameterized but finite-width neural network by GD for nonparametric regression without distributional assumptions about the spherical covariate.
Paperid:1182
Authors:Haidong Kang
Abstract:
Neural Architecture Search (NAS) has recently outperformed handdesigned networks in various artificial intelligence areas. However, previous works only target a pre-defined task. For a new task in few-shot learning (FSL) scenarios, the architecture is either searched from scratch, which is neither efficient nor flexible, or borrowed architecture from the ones obtained on other tasks, which may lead to sub-optimal. Can we select the best neural architectures without involving any training and eliminate a significant portion of the search cost for new tasks in FSL? In this work, we provide an affirmative answer by proposing a novel information bottleneck (IB) theory driven \textit{Few-shot Neural Architecture Search} (dubbed, IBFS) framework to address this issue. We first derive that the global convergence of Model-agnostic meta-learning (MAML) can be guaranteed by only considering the first-order loss landscape. Moreover, motivated by the observation that IB provides a unified view toward understanding machine learning models, we propose a novel Zero-Cost method tailored for FSL to rank and select architectures based on their \textit{expressivity} obtained by IB mechanisms. Extensive experiments show that IBFS achieves state-of-the-art performance in FSL without training, which demonstrates the effectiveness of our IBFS.
Paperid:1183
Authors:Mingzhen He · Ruikai Yang · Hanling Tian · Youmei Qiu · Xiaolin Huang
Abstract:
Graph Transformers (GTs) have emerged as a promising approach for graph representation learning. Despite their successes, the quadratic complexity of GTs limits scalability on large graphs due to their pairwise computations. To fundamentally reduce the computational burden of GTs, we propose a primal-dual framework that interprets the self-attention mechanism on graphs as a dual representation. Based on this framework, we develop Primphormer, an efficient GT that leverages a primal representation with linear complexity. Theoretical analysis reveals that Primphormer serves as a universal approximator for functions on both sequences and graphs, while also retaining its expressive power for distinguishing non-isomorphic graphs.Extensive experiments on various graph benchmarks demonstrate that Primphormer achieves competitive empirical results while maintaining a more user-friendly memory and computational costs.
Paperid:1184
Authors:Jesse He · Helen Jenne · Herman Chau · Davis Brown · Mark Raugas · Sara Billey · Henry Kvinge
Abstract:
Abstract:Machine learning is becoming an increasingly valuable tool in mathematics, enabling one to identify subtle patterns across collections of examples so vast that they would be impossible for a single researcher to feasibly review and analyze. In this work, we use graph neural networks to investigate quiver mutationan operation that transforms one quiver (or directed multigraph) into another---which is central to the theory of cluster algebras with deep connections to geometry, topology, and physics. In the study of cluster algebras, the question of mutation equivalence is of fundamental concern: given two quivers, can one efficiently determine if one quiver can be transformed into the other through a sequence of mutations? In this paper, we use graph neural networks and AI explainability techniques to independently discover mutation equivalence criteria for quivers of type $\tilde{D}$. Along the way, we also show that even without explicit training to do so, our model captures structure within its hidden representation that allows us to reconstruct known criteria from type $D$, adding to the growing evidence that modern machine learning models are capable of learning abstract and general rules from mathematical data.
Paperid:1185
Authors:Lea Bogensperger · Dominik Narnhofer · Ahmed Allam · Konrad Schindler · Michael Krauthammer
Abstract:
The goal of protein fitness optimization is to discover new protein variants with enhanced fitness for a given use. The vast search space and the sparsely populated fitness landscape, along with the discrete nature of protein sequences, pose significant challenges when trying to determine the gradient towards configurations with higher fitness. We introduceVariational Latent Generative Protein Optimization(VLGPO), a variational perspective on fitness optimization. Our method embeds protein sequences in a continuous latent space to enable efficient sampling from the fitness distribution and combines a (learned) flow matching prior over sequence mutations with a fitness predictor to guide optimization towards sequences with high fitness. VLGPO achieves stateof-the-art results on two different protein benchmarks of varying complexity. Moreover, the variational design with explicit prior and likelihood functions offers a flexible plug-and-play framework that can be easily customized to suit various protein design tasks.
Paperid:1186
Authors:Shanda Li · Shinjae Yoo · Yiming Yang
Abstract:
Abstract:Fourier Neural Operators (FNOs) offer a principled approach for solving complex partial differential equations (PDEs). However, scaling them to handle more intricate PDEs requires increasing the number of Fourier modes, which significantly expands the number of model parameters and makes hyperparameter tuning computationally impractical. To address this, we introduce $\mu$**TransferFNO**, a zero-shot hyperparameter transfer technique that enables optimal configurations, tuned on smaller FNOs, to be directly applied to billion-parameter FNOs _without_ additional tuning. Building on the Maximum Update Parametrization ($\mu$P) framework, we mathematically derive a parametrization scheme that facilitates the transfer of optimal hyperparameters across models with different numbers of Fourier modes in FNOs, which is validated through extensive experiments on various PDEs. Our empirical study shows that $\mu$Transfer-FNO reduces computational cost for tuning hyperparameters on large FNOs while maintaining or improving accuracy.
Paperid:1187
Authors:Chuanhui Liu · Xiao Wang
Abstract:
This paper provides a comprehensive analysis of variational inference in latent variable models for survival analysis, emphasizing the distinctive challenges associated with applying variational methods to survival data. We identify a critical weakness in the existing methodology, demonstrating how a poorly designed variational distribution may hinder the objective of survival analysis tasks—modeling timeto-event distributions. We prove that the optimal variational distribution, which perfectly bounds the log-likelihood, may depend on the censoring mechanism. To address this issue, we propose censor dependent variational inference (CDVI), tailored for latent variable models in survival analysis. More practically, we introduce CD-CVAE, a V-structure Variational Autoencoder (VAE) designed for the scalable implementation of CDVI. Further discussion extends some existing theories and training techniques to survival analysis. Extensive experiments validate our analysis and demonstrate significant improvements in the estimation of individual survival distributions.
Paperid:1188
Authors:Jason Qin · Hans-Hermann Wessels · Carlos Fernandez-Granda · Yuhan Hao
Abstract:
Combinatorial CRISPR screening enables largescale identification of synergistic gene pairs for combination therapies, but exhaustive experimentation is infeasible. We introduce NAIAD (https://anonymous.4open.science/r/NAIAD-4CA6/), an active learning framework that efficiently discovers optimal gene pairs by leveraging single-gene perturbation effects and adaptive gene embeddings that scale with the training data size, mitigating overfitting in small-sample learning while capturing complex gene interactions as more data is collected. Evaluated on four CRISPR datasets with over 350,000 interactions, NAIAD trained on small datasets outperforms existing models by up to 40\%. Its recommendation system prioritizes gene pairs with the maximum predicted effects, accelerating discovery with fewer experiments. We also extend NAIAD to optimal drug combination identification among 2,000 candidates. Overall, NAIAD enhances combinatorial perturbation design and drives advances in genomics research and therapeutic development in combination therapy.
Paperid:1189
Authors:Jiahui Zhu · Kihyun Yu · Dabeen Lee · Xin Liu · Honghao Wei
Abstract:
Abstract:Online safe reinforcement learning (RL) plays a key role in dynamic environments, with applications in autonomous driving, robotics, and cybersecurity. The objective is to learn optimal policies that maximize rewards while satisfying safety constraints modeled by constrained Markov decision processes (CMDPs). Existing methods achieve sublinear regret under stochastic constraints but often fail in adversarial settings, where constraints are unknown, timevarying, and potentially adversarially designed. In this paper, we propose the Optimistic Mirror Descent Primal-Dual (OMDPD) algorithm, the first to address online CMDPs with anytime adversarial constraints. OMDPD achieves optimal regret $\tilde{\mathcal{O}}(\sqrt{K})$ and strong constraint violation $\tilde{\mathcal{O}}(\sqrt{K})$ without relying on Slater’s condition or the existence of a strictly known safe policy. We further show that access to accurate estimates of rewards and transitions can further improve these bounds. Our results offer practical guarantees for safe decision-making in adversarial environments.
Paperid:1190
Authors:Laixi Shi · Jingchu Gai · Eric Mazumdar · Yuejie Chi · Adam Wierman
Abstract:
Standard multiagent reinforcement learning (MARL) algorithms are vulnerable to sim-to-real gaps. To address this, distributionally robust Markov games (RMGs) have been proposed to enhance robustness in MARL by optimizing the worst-case performance when game dynamics shift within a prescribed uncertainty set. RMGs remains under-explored, from reasonable problem formulation to the development of sample-efficient algorithms. Two notorious and open challenges are the formulation of the uncertainty set and whether the corresponding RMGs can overcome the curse of multiagency, where the sample complexity scales exponentially with the number of agents. In this work, we propose a natural class of RMGs inspired by behavioral economics, where each agent's uncertainty set is shaped by both the environment and the integrated behavior of other agents. We first establish the well-posedness of this class of RMGs by proving the existence of game-theoretic solutions such as robust Nash equilibria and coarse correlated equilibria (CCE). Assuming access to a generative model, we then introduce a sample-efficient algorithm for learning the CCE whose sample complexity scales polynomially with all relevant parameters. To the best of our knowledge, this is the first algorithm to break the curse of multiagency for RMGs, regardless of the uncertainty set formulation.
Paperid:1191
Authors:Michael Munn · Susan Wei
Abstract:
Recent advances in artificial intelligence have been fueled by the development of foundation models such as BERT, GPT, T5, and VisionTransformers. These models are first pretrained on vast and diverse datasets and then adapted to specific downstream tasks, often with significantly less data. However, the mechanisms behind the success of this ubiquitous pretrainthen-adapt paradigm remain underexplored, particularly the characteristics of pretraining checkpoints that enhance downstream adaptation. We introduce a Bayesian model selection criterion, called the downstream free energy, which quantifies a checkpoint’s adaptability by measuring the concentration of nearby favorable parameters for a downstream task. We demonstrate that this Bayesian model selection criterion can be effectively implemented without access to the downstream data or prior knowledge of the downstream task. Furthermore, we provide empirical evidence that the criterion reliably correlates with improved fine-tuning performance, offering a principled approach to predicting model adaptability.
Paperid:1192
Authors:Yixin Cheng · Hongcheng Guo · Yangming Li · Leonid Sigal
Abstract:
Text watermarking aims to subtly embeds statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in highentropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100\% attack success rates on seven recent watermarking methods with only \$0.88 per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model even mobile-level models. Our findings highlight the urgent need for more robust watermarking.
Paperid:1193
Authors:Wen Wang · Ruibing Hou · Hong Chang · Shiguang Shan · Xilin Chen
Abstract:
Large audiolanguage models (LALMs), built upon powerful Large Language Models (LLMs), have exhibited remarkable audio comprehension and reasoning capabilities. However, the training of LALMs demands a large corpus of audio-language pairs, which requires substantial costs in both data collection and training resources. In this paper, we proposeMATS, an audio-language multimodal LLM designed to handleMultipleAudio task using solelyText-onlySupervision. By levaraging pre-trained audio-language alignment models such as CLAP, we develop a text-only training strategy that projects the shared audio-language latent space into LLM latent space, endowing the LLM with audio comprehension capabilities without relying on audio data during training. To further bridge the modality gap between audio and language embeddings within CLAP, we propose theStrongly-relatednoisytext withaudio (Santa) mechanism. Santa maps audio embeddings into CLAP language embedding space while preserving essential information from the audio input. Extensive experiments demonstrate that MATS, despite being trained exclusively on text data, achieves competitive performance compared to recent LALMs trained on large-scale audio-language pairs.
Paperid:1194
Authors:Haibo Chen · Xin Wang · Zeyang Zhang · Haoyang Li · Ling Feng · Wenwu Zhu
Abstract:
Graph Foundation Models (GFMs) aim to share graph knowledge of diverse graph domains and tasks to boost graph machine learning. However, existing GFMs rely on handdesigned and fixed Graph Neural Network (GNN) architectures, failing to utilize optimal graph neural architectures \textit{w.r.t.} specific domains and tasks, inevitably leading to suboptimal performance in diverse graph domains and tasks. In this paper, we explore automated graph neural architecture search (GNAS) for GFMs, which suffers from \textit{architecture inconsistency} problem, i.e., the optimal architectures for different tasks and domains vary. We tackle this problem by discovering an invariant graph-architecture relationship across domains and tasks.However, it imposes three challenges: i) how to capture invariant and variant patterns; ii) how to customize graph neural architectures to adapt to diverse domains and tasks; iii) how to mitigate the data domination phenomenon during the architecture search process.To address these challenges, we propose a novel Automated Graph Foundation Model with Adaptive Graph Neural Architecture Customization (\textbf{GFA}). Specifically, we first propose a disentangled contrastive graph encoder to learn invariant and variant patterns from graph data. Then, we design an invariant guided architecture customization strategy to customize graph neural architectures for data from diverse domains and tasks. Finally, we propose a curriculum architecture customization mechanism to mitigate the phenomenon of some particular data dominating the search process. We further provide theoretical analysis demonstrating the limitation of existing GNAS methods under the \textit{architecture inconsistency} problem. Extensive experiments demonstrate that our \model model outperforms baselines, achieving state-of-the-art performance. To the best of our knowledge, this is the first work to explore the problem of graph neural architecture search for GFMs.
Paperid:1195
Authors:Konstantin Donhauser · Kristina Ulicna · Gemma Moran · Aditya Ravuri · Kian Kenyon-Dean · Cian Eastwood · Jason Hartford
Abstract:
Sparse dictionary learning (DL) has become a popular tool for interpreting trained large language models (LLM). Specifically, sparse DL approaches can extract semantically meaningful concepts from the internals of LLMs trained on text. In this work, we ask if DL can also be used to extract concepts from less humaninterpretable scientific data, such as cell microscopy foundation models trained on cell imaging data, where limited prior knowledge exists regarding which high-level concepts should arise. We propose a novel combination of a DL algorithm, Iterative Codebook Feature Learning~(ICFL), with a pre-processing step using PCA whitening from a control dataset. We demonstrate that our approach successfully retrieves biologically meaningful concepts, such as cell type and genetic perturbations. Moreover, we demonstrate how our approach can help better understand morphological changes in cells induced by genetic perturbations.
Paperid:1196
Authors:Fan Qi · Daxu Shi · Chuokun Xu · Changsheng Xu · Shuai Li
Abstract:
Federated Distillation (FedKD) relies on lightweight knowledge carriers like logits for efficient clientserver communication. Although logit-based methods have demonstrated promise in addressing statistical and architectural heterogeneity in federated learning (FL), current approaches remain constrained by suboptimal temperature calibration during knowledge fusion.To address these limitations, we propose ReT-FHD, a framework featuring: 1) Multi-level Elastic Temperature, which dynamically adjusts distillation intensities across model layers, achieving optimized knowledge transfer between heterogeneous local models; 2) Category-Aware Global Temperature Scaling that implements class-specific temperature calibration based on confidence distributions in global logits, enabling personalized distillation policies; 3) Z-Score Guard, a blockchain-verified validation mechanism mitigating 44\% of label-flipping and model poisoning attacks. Evaluations across diverse benchmarks with varying model/data heterogeneity demonstrate that the ReT-FHD achieves significant accuracy improvements over baseline methods while substantially reducing communication costs compared to existing approaches. Our work establishes that properly calibrated logits can serve as self-sufficient carriers for building scalable and secure heterogeneous FL systems.The code is available in the additional material.
Paperid:1197
Authors:Ye Liu · Yuntian Chen
Abstract:
Abstract:Automotive drag coefficient ($C_d$) is pivotal to energy efficiency, fuel consumption, and aerodynamic performance. However, costly computational fluid dynamics (CFD) simulations and wind tunnel tests struggle to meet the rapiditeration demands of automotive design. We present DragSolver, a Transformer-based framework for rapid and accurate $C_d$ estimation from large-scale, diverse 3D vehicle models.DragSolver tackles four key real-world challenges: (1) multi-scale feature extraction to capture both global shape and fine local geometry; (2) heterogeneous scale normalization to handle meshes with varying sizes and densities;(3) surface-guided gating to suppress internal structures irrelevant to external aerodynamics;and (4) epistemic uncertainty estimation via Monte Carlo dropout for risk-aware design. Extensive evaluations on three industrial-scale datasets (DrivaerNet, DrivaerNet++, and DrivaerML) show that DragSolver outperforms existing approaches in accuracy and generalization, achieving an average reduction of relative $L_2$ error by 58.7% across real-world datasets. Crucially, DragSolver is the first to achieve reliable, real-time $C_d$ inference on production-level automotive geometries. We will release our code upon acceptance to encourage broader adoption of learning-based aerodynamic analytics.
Paperid:1198
Authors:Jaehyeon Kim · Taehong Moon · Keon Lee · Jaewoong Cho
Abstract:
We introduce ResGen, an efficient Residual Vector Quantization (RVQ)based generative model for high-fidelity generation with fast sampling. RVQ improves data fidelity by increasing the number of quantization steps, referred to as depth, but deeper quantization typically increases inference steps in generative models. To address this, ResGen directly predicts the vector embedding of collective tokens rather than individual ones, ensuring that inference steps remain independent of RVQ depth. Additionally, we formulate token masking and multi-token prediction within a probabilistic framework using discrete diffusion and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation on ImageNet 256×256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models.
Paperid:1199
Authors:Neha Sangwan · Arya Mazumdar
Abstract:
Abstract:We consider the problem of *exact* recovery of a $k$sparse binary vector from generalized linear measurements (such as *logistic regression*). We analyze the *linear estimation* algorithm (Plan, Vershynin, Yudovina, 2017), and also show information theoretic lower bounds on the number of required measurements. As a consequence of our results, for noisy one bit quantized linear measurements ($\mathsf{1bCS}$), we obtain a sample complexity of $O((k+\sigma^2)\log{n})$, where $\sigma^2$ is the noise variance. This is shown to be optimal due to the information theoretic lower bound. We also obtain tight sample complexity characterization for logistic regression. Since $\mathsf{1bCS}$ is a strictly harder problem than noisy linear measurements ($\mathsf{SparseLinearReg}$) because of added quantization, the same sample complexity is achievable for $\mathsf{SparseLinearReg}$. While this sample complexity can be obtained via the popular lasso algorithm, linear estimation is computationally more efficient. Our lower bound holds for any set of measurements for $\mathsf{SparseLinearReg}$ (similar bound was known for Gaussian measurement matrices) and is closely matched by the maximum-likelihood upper bound. For $\mathsf{SparseLinearReg}$, it was conjectured in Gamarnik and Zadik, 2017 that there is a statistical-computational gap and the number of measurements should be at least $(2k+\sigma^2)\log{n}$ for efficient algorithms to exist. It is worth noting that our results imply that there is no such statistical-computational gap for $\mathsf{1bCS}$ and logistic regression.
Paperid:1200
Authors:Zongyu Lin · Yao Tang · Xingcheng Yao · Da Yin · ziniu hu · Yizhou Sun · Kai-Wei Chang
Abstract:
Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to suboptimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.
Paperid:1201
Authors:Yu-Yang Qian · Yuan-Ze Xu · Zhen-Yu Zhang · Peng Zhao · Zhi-Hua Zhou
Abstract:
Many realworld applications collect data in a streaming environment, where learning tasks are encountered sequentially. This necessitatescontinual learning(CL) to update models online, enabling adaptation to new tasks while preserving past knowledge to prevent catastrophic forgetting. Nowadays, with the flourish oflarge pre-trained models(LPMs),efficiencyhas become increasingly critical for CL, due to their substantial computational demands and growing parameter sizes. In this paper, we introduce TreeLoRA (K-D Tree of Low-Rank Adapters), a novel approach that constructslayer-wiseadapters by leveraging hierarchical gradient similarity to enable efficient CL, particularly for LPMs. To reduce the computational burden of task similarity estimation, we employbandittechniques to develop an algorithm based on lower confidence bounds to efficiently explore the task structure. Furthermore, we use sparse gradient updates to facilitate parameter optimization, making the approach better suited for LPMs. Theoretical analysis is provided to justify the rationale behind our approach, and experiments on bothvision transformers(ViTs) andlarge language models(LLMs) demonstrate the effectiveness and efficiency of our approach across various domains, including vision and natural language processing tasks.
Paperid:1202
Authors:Anna Thomas · Adam Yee · Andrew Mayne · Maya Mathur · Dan Jurafsky · Kristina Gligoric
Abstract:
Food systems are responsible for a third of humancaused greenhouse gas emissions. We investigate what Large Language Models (LLMs) can contribute to reducing the environmental impacts of food production. We define a typology of design and prediction tasks based on the sustainable food literature and collaboration with domain experts, and evaluate six LLMs on four tasks in our typology. For example, for a sustainable protein design task, food science experts estimated that collaboration with an LLM can reduce time spent by 45% on average, compared to 22% for collaboration with another expert human food scientist. However, for a sustainable menu design task, LLMs produce suboptimal solutions when instructed to consider both human satisfaction and climate impacts. We propose a general framework for integrating LLMs with combinatorial optimization to improve reasoning capabilities. Our approach decreases emissions of food choices by 79% in a hypothetical restaurant while maintaining participants' satisfaction with their set of choices. Our results demonstrate LLMs' potential, supported by optimization techniques, to accelerate sustainable food development and adoption.
Paperid:1203
Authors:Mateusz Piotrowski · Paul Riechers · Daniel Filan · Adam Shai
Abstract:
What computational structures emerge in transformers trained on nexttoken prediction? In this work, we provide evidence that transformers implement constrained Bayesian belief updating---a parallelized version of partial Bayesian inference shaped by architectural constraints. To do this, we integrate the model-agnostic theory of optimal prediction with mechanistic interpretability to analyze transformers trained on a tractable family of hidden Markov models that generate rich geometric patterns in neural activations. We find that attention heads carry out an algorithm with a natural interpretation in the probability simplex, and create representations with distinctive geometric structure. We show how both the algorithmic behavior and the underlying geometry of these representations can be theoretically predicted in detail---including the attention pattern, OV-vectors, and embedding vectors---by modifying the equations for optimal future token predictions to account for the architectural constraints of attention. Our approach provides a principled lens on how gradient descent resolves the tension between optimal prediction and architectural design.
Paperid:1204
Authors:Zhengchao Wan · Qingsong Wang · Gal Mishne · Yusu Wang
Abstract:
Flow matching (FM) models extend ODE sampler based diffusion models into a general framework, significantly reducing sampling steps through learned vector fields. However, the theoretical understanding of FM models, particularly how their sample trajectories interact with underlying data geometry, remains underexplored. A rigorous theoretical analysis of FM ODE is essential for sample quality, stability, and broader applicability.In this paper, we advance the theory of FM models through a comprehensive analysis of sample trajectories. Central to our theory is the discovery that the denoiser, a key component of FM models, guides ODE dynamics through attracting and absorbing behaviors that adapt data geometry. Our results identify and analyze the three stages of ODE evolution: in the initial and intermediate stages, trajectories move toward the mean and local clusters of the data. At the terminal stage, we rigorously establish the convergence of FM ODE under weak assumptions, addressing scenarios where the data lie on a lowdimensional submanifold---cases that previous results could not handle. Our terminal stage analysis offers insights into the memorization phenomenon and establishes equivariance properties of FM ODEs. These findings bridge critical gaps in understanding flow matching models, with practical implications for optimizing sampling strategies and architectures guided by the intrinsic geometry of data.
Paperid:1205
Authors:Tyler Kastner · Mark Rowland · Yunhao Tang · Murat Erdogdu · Amir-massoud Farahmand
Abstract:
We study the problem of distributional reinforcement learning using categorical parameterisations and a KL divergence loss. Previous work analyzing categorical distributional RL has done so using a Cramér distancebased loss, simplifying the analysis but creating a theory-practice gap. We introduce a preconditioned version of the algorithm, and prove that it is guaranteed to converge. We further derive the asymptotic variance of the categorical estimates under different learning rate regimes, compare to that of classical reinforcement learning, and analyze how these updates are affected in the linear function approximation setting. We finally empirically validate our results and perform an empirical investigation into the relative strengths of using KL losses, and derive a number of actionable insights for practitioners.
Paperid:1206
Authors:Qiuhao Wang · Yuqi Zha · Chin Pang Ho · Marek Petrik
Abstract:
Robust Markov Decision Processes (MDPs) offer a promising framework for computing reliable policies under model uncertainty. While policy gradient methods have gained increasing popularity in robust discounted MDPs, their application to the averagereward criterion remains largely unexplored. This paper proposes a Robust Projected Policy Gradient (RP2G), the first generic policy gradient method for robust average-reward MDPs (RAMDPs) that is applicable beyond the typical rectangularity assumption on transition ambiguity. In contrast to existing robust policy gradient algorithms, RP2G incorporates an adaptive decreasing tolerance mechanism for efficient policy updates at each iteration. We also present a comprehensive convergence analysis of RP2G for solving ergodic tabular RAMDPs. Furthermore, we establish the first study of the inner worst-case transition evaluation problem in RAMDPs, proposing two gradient-based algorithms tailored for rectangular and general ambiguity sets, each with provable convergence guarantees. Numerical experiments confirm the global convergence of our new algorithm and demonstrate its superior performance.
Paperid:1207
Authors:Matthieu Meeus · Lukas Wutschitz · Santiago Zanella-Beguelin · Shruti Tople · Reza Shokri
Abstract:
How much information about training samples can be gleaned from synthetic data generated by Large Language Models (LLMs)? Overlooking the subtleties of information flow in synthetic data generation pipelines can lead to a false sense of privacy. In this paper, we design membership inference attacks (MIAs) that target data used to finetune pre-trained LLMs that are then used to synthesize data, particularly when the adversary does not have access to the fine-tuned model but only to the synthetic data. We show that such data-based MIAs do significantly better than a random guess, meaning that synthetic data leaks information about the training data. Further, we find that canaries crafted to maximize vulnerability to model-based MIAs are sub-optimal for privacy auditing when only synthetic data is released. Such out-of-distribution canaries have limited influence on the model’s output when prompted to generate useful, in-distribution synthetic data, which drastically reduces their vulnerability. To tackle this problem, we leverage the mechanics of auto-regressive models to design canaries with an in-distribution prefix and a high-perplexity suffix that leave detectable traces in synthetic data. This enhances the power of data-based MIAs and provides a better assessment of the privacy risks of releasing synthetic data generated by LLMs.
Paperid:1208
Authors:Meera Hahn · Wenjun Zeng · Nithish Kannen · Rich Galt · Kartikeya Badola · Been Kim · Zi Wang
Abstract:
User prompts for generative AI models are often underspecified, leading to suboptimal responses. This problem is particularly evident in text-to-image (T2I) generation, where users commonly struggle to articulate their precise intent. This disconnect between the user's vision and the model's interpretation often forces users to painstakingly and repeatedly refine their prompts. To address this, we propose a design for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their understanding of user intent as an understandable belief graph that a user can edit. We build simple prototypes for such agents and verify their effectiveness through both human studies and automated evaluation. We observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow. Moreover, we develop a scalable automated evaluation approach using two agents, one with a ground truth image and the other tries to ask as few questions as possible to align with the ground truth. On DesignBench, a benchmark we created for artists and designers, the COCO dataset (Lin et al., 2014), and ImageInWords (Garg et al., 2024), we observed that these T2I agents were able to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard single-turn T2I generation. We plan to release our code and dataset: https://anonymous.4open.science/r/proactivet2iagents-6F68/README.md (anonymized code).
Paperid:1209
Authors:Angxiao Yue · Zichong Wang · Hongteng Xu
Abstract:
Abstract:Protein backbone generation plays a central role in de novo protein design and is significant for many biological and medical applications.Although diffusion and flowbased generative models provide potential solutions to this challenging task, they often generate proteins with undesired designability and suffer computational inefficiency.In this study, we propose a novel rectified quaternion flow (ReQFlow) matching method for fast and high-quality protein backbone generation. In particular, our method generates a local translation and a 3D rotation from random noise for each residue in a protein chain, which represents each 3D rotation as a unit quaternion and constructs its flow by spherical linear interpolation (SLERP) in an exponential format.We train the model by quaternion flow (QFlow) matching with guaranteed numerical stability and rectify the QFlow model to accelerate its inference and improve the designability of generated protein backbones, leading to the proposed ReQFlow model. Experiments show that ReQFlow achieves state-of-the-art performance in protein backbone generation while requiring much fewer sampling steps and significantly less inference time (e.g., being 37$\times$ faster than RFDiffusion and 62$\times$ faster than Genie2 when generating a backbone of length 300), demonstrating its effectiveness and efficiency.
Paperid:1210
Authors:Batiste Le Bars · Pierre Humbert
Abstract:
We study the question of volume optimality in split conformal regression, a topic still poorly understood in comparison to coverage control. Using the fact that the calibration step can be seen as an empirical volume minimization problem, we first derive a finitesample upper-bound on the excess volume loss of the interval returned by the classical split method. This important quantity measures the difference in length between the interval obtained with the split method and the shortest oracle prediction interval. Then, we introduceEffOrt, a methodology that modifies the learning step so that the base prediction function is selected in order to minimize the length of the returned intervals. In particular, our theoretical analysis of the excess volume loss of the prediction sets produced byEffOrtreveals the links between the learning and calibration steps, and notably the impact of the choice of the function class of the base predictor. We also introduceAd-EffOrt, an extension of the previous method, which produces intervals whose size adapts to the value of the covariate. Finally, we evaluate the empirical performance and the robustness of our methodologies.
Paperid:1211
Authors:Alberto Bernacchia
Abstract:
Abstract:Secondorder optimization methods, which leverage the local curvature of the loss function, have the potential to dramatically accelerate the training of machine learning models. However, these methods are often hindered by the computational burden of constructing and inverting large curvature matrices with $\mathcal{O}(p^2)$ elements, where $p$ is the number of parameters. In this work, we present a theoretical framework for exact computation of the global curvature by leveraging the intrinsic symmetries of neural networks, such as invariance under parameter permutations. For Multi-Layer Perceptrons (MLPs), our approach reveals that the global curvature can be expressed in terms of $\mathcal{O}(d^2 + L^2)$ independent values, where $d$ is the number of input/output dimensions and $L$ is the number of layers, significantly reducing the computational burden compared to the $\mathcal{O}(p^2)$ elements of the full matrix. These values can be estimated efficiently, enabling precise curvature computations. We demonstrate that the curvature structure is influenced by the choice of activation function, with ReLU activations resulting in more complex curvature than Tanh or linear activations. To evaluate the practical implications of our framework, we apply exact second-order optimization to synthetic data, achieving markedly faster convergence compared to traditional optimization methods. Our findings pave the way for a better understanding of the loss landscape of neural networks, and for designing more efficient training methodologies in deep learning.
Paperid:1212
Authors:Yongchao Chen · Yilun Hao · Yueying Liu · Yang Zhang · Chuchu Fan
Abstract:
Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multiround guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-round supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks.
Paperid:1213
Authors:Haotian Si · Changhua Pei · Dan Pei · Gaogang Xie · Jianhui LI
Abstract:
Recent advances in lightweight time series forecasting models suggest the inherent simplicity of time series forecasting tasks. In this paper, we present CMoS, a superlightweight time series forecasting model. Instead of learning the embedding of the shapes, CMoS directly models the spatial correlations between different time series chunks. Additionally, we introduce a Correlation Mixing technique that enables the model to capture diverse spatial correlations with minimal parameters, and an optional Periodicity Injection technique to ensure faster convergence. Despite utilizing as low as 1% of the lightweight model DLinear's parameters count, experimental results demonstrate that CMoS outperforms existing state-of-the-art models across multiple datasets. Furthermore, the learned weights of CMoS exhibit great interpretability, providing practitioners with valuable insights into temporal structures within specific application scenarios.
Paperid:1214
Authors:Florian Peter Busch · ROSHNI KAMATH · Rupert Mitchell · Wolfgang Stammer · Kristian Kersting · Martin Mundt
Abstract:
A dataset is confounded if it is most easily solved via a spurious correlation which fails to generalize to new data. In this work, we show that, in a continual learning setting where confounders may vary in time across tasks, the challenge of mitigating the effect of confounders far exceeds the standard forgetting problem normally considered. In particular, we provide a formal description of such continual confounders and identify that, in general, spurious correlations are easily ignored when training for all tasks jointly, but it is harder to avoid confounding when they are considered sequentially. These descriptions serve as a basis for constructing a novel CLEVRbased continually confounded dataset, which we term the ConCon dataset. Our evaluations demonstrate that standard continual learning methods fail to ignore the dataset's confounders. Overall, our work highlights the challenges of confounding factors, particularly in continual learning settings, and demonstrates the need for developing continual learning methods to robustly tackle these.
Paperid:1215
Authors:Kwangjun Ahn · Gagik Magakyan · Ashok Cutkosky
Abstract:
This work investigates the effectiveness of schedulefree methods, developed by A. Defazio et al. (NeurIPS 2024), in nonconvex optimization settings, inspired by their remarkable empirical success in training neural networks. Specifically, we show that schedule-free SGD achieves optimal iteration complexity for nonsmooth, non-convex optimization problems. Our proof begins with the development of a general framework for online-to-nonconvex conversion, which converts a given online learning algorithm into an optimization algorithm for nonconvex losses. Our general framework not only recovers existing conversions but also leads to two novel conversion schemes. Notably, one of these new conversions corresponds directly to schedule-free SGD, allowing us to establish its optimality. Additionally, our analysis provides valuable insights into the parameter choices for schedule-free SGD, addressing a theoretical gap that the convex theory cannot explain.
Paperid:1216
Authors:Aaditya Singh · Ted Moskovitz · Sara Dragutinović · Feilx Hill · Stephanie Chan · Andrew Saxe
Abstract:
Incontext learning (ICL) is a powerful ability that emerges in transformer models, enabling them to learn from context without weight updates. Recent work has established emergent ICL as a transient phenomenon that can sometimes disappear after long training times. In this work, we sought a mechanistic understanding of these transient dynamics. Firstly, we find that—after the disappearance of ICL—the asymptotic strategy is a remarkable hybrid between in-weights and in-context learning, which we term “context-constrained in-weights learning” (CIWL). CIWL is in competition with ICL, and eventually replaces it as the dominant strategy of the model (thus leading to ICL transience). However, we also find that the two competing strategies actually share sub-circuits, which gives rise to cooperative dynamics as well. For example, in our setup, ICL is unable to emerge quickly on its own, and can only be enabled through the simultaneous slow development of asymptotic CIWL. CIWL thus both cooperates and competes with ICL, a phenomenon we term “strategy coopetition”. Wepropose a minimal mathematical model that reproduces these key dynamics and interactions. Informed by this model, we were able to identify a setup where ICL is truly emergent and persistent.
Paperid:1217
Authors:Dong-Hee Shin · Young-Han Son · Hyun Jung Lee · Deok-Joong Lee · Tae-Eui Kam
Abstract:
Molecular discovery has attracted significant attention in scientific fields for its ability to generate novel molecules with desirable properties. Although numerous methods have been developed to tackle this problem, most rely on an online setting that requires repeated online evaluation of candidate molecules using the oracle. However, in realworld molecular discovery, the oracle is often represented by wet-lab experiments, making this online setting impractical due to the significant time and resource demands. To fill this gap, we propose the Molecular Stitching (MolStitch) framework, which utilizes a fixed offline dataset to explore and optimize molecules without the need for repeated oracle evaluations. Specifically, MolStitch leverages existing molecules from the offline dataset to generate novel `stitched molecules' that combine their desirable properties. These stitched molecules are then used as training samples to fine-tune the generative model using preference optimization techniques. Experimental results on various offline multi-objective molecular optimization problems validate the effectiveness of MolStitch. The source code is available online.
Paperid:1218
Authors:Varshita Kolipaka · Akshit Sinha · Debangan Mishra · Sumit Kumar · Arvindh Arun · Shashwat Goel · Ponnurangam Kumaraguru
Abstract:
Abstract:Graph Neural Networks (GNNs) are increasingly being used for a variety of ML applications on graph data. Because graph data does not follow the independently and identically distributed *i.i.d.* assumption, adversarial manipulations or incorrect data can propagate to other data points through message passing, which deteriorates the model's performance. To allow model developers to remove the adverse effects of manipulated entities from a trained GNN, we study the recently formulated problem of *Corrective Unlearning*. We find that current graph unlearning methods fail to unlearn the effect of manipulations even when the whole manipulated set is known. We introduce a new graph unlearning method,**Cognac**, which can unlearn the effect of the manipulation set even when only $5$% of it is identified. It recovers most of the performance of a strong oracle with fully corrected training data, even beating retraining from scratch without the deletion set, and is 8x more efficient while also scaling to large datasets. We hope our work assists GNN developers in mitigating harmful effects caused by issues in realworld data, post-training.
Paperid:1219
Authors:Xiaoyu Li · Zhao Song · Shenghao Xie
Abstract:
Abstract:The Fourier transform is a fundamental tool in theoretical computer science and signal processing. In particular, when the signal is sparse in the frequency domainhaving only $k$ distinct frequencies---sparse Fourier transform (SFT) algorithms can recover the signal in a sublinear time (proportional to the sparsity $k$). Most prior research focused on SFT for discrete signals, designing both randomized and deterministic algorithms for one-dimensional and high-dimensional discrete signals. However, SFT for continuous signals (i.e., $x^*(t)=\sum_{j=1}^k v_j e^{2\pi \mathbf{i} f_j t}$ for $t\in [0,T]$) is a more challenging task. Note that the discrete SFT algorithms are not directly applicable to continuous signals due to the sparsity blow-up from the discretization. In a seminal work by [Price and Song, FOCS 2015], they proposed the first randomized algorithm that achieves an $\ell_2$ recovery guarantee in $O(k\cdot \mathrm{polylog}(FT/\eta))$ time, where $F$ is the band-limit of the frequencies and $\eta$ is the frequency gap.Nevertheless, whether we can solve this problem without using randomness remains open. In this work, we address this gap and introduce the first sublinear-time deterministic sparse Fourier transform algorithm in the continuous setting. Specifically, our algorithm uses $O(k^2 \cdot \mathrm{polylog}(FT/\eta))$ samples and $O(k^2 \cdot \mathrm{polylog}(FT/\eta))$ time to reconstruct the on-grid signal with arbitrary noise in the $\ell_1/\ell_2$ mixed regime. It is the optimal recovery guarantee that can be achieved by any deterministic approach.
Paperid:1220
Authors:Srinath Dama · Kevin L Course · Prasanth B Nair
Abstract:
We present an operatortheoretic framework for time-series forecasting that involves learning a continuous time-shift operator associated with temporal and spatio-temporal problems. The time-shift operator learning paradigm offers a continuous relaxation of the discrete lag factor used in traditional autoregressive models enabling the history of a function up to a given time to be mapped to its future values. To parametrize the operator learning problem, we propose Khatri-Rao neural operators -- a new architecture for defining non-stationary integral transforms which achieves almost linear cost on spatial and spatio-temporal problems. The proposed KRNO integral transform layer can be viewed as a continuous analog of upsampling and downsampling layers used in convolutional neural networks. From a practical perspective, the advancements made in this work allow us to forecast at super-resolution in both space and time, while handling observations that are sampled with a moderate level of irregularity. Detailed numerical studies across a wide range of temporal and spatio-temporal benchmark problems suggest that the proposed approach is highly scalable and compares favourably with state-of-the-art methods.
Paperid:1221
Authors:Xin Zou · Yizhou WANG · Yibo Yan · Yuanhuiyi Lyu · Kening Zheng · Sirui Huang · Junkai Chen · Peijie Jiang · Jia Liu · Chang Tang · Xuming Hu
Abstract:
Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) are prone to hallucinations, i.e., the generated content that is nonsensical or unfaithful to input sources.Unlike in LLMs, hallucinations in MLLMs often stem from the sensitivity of text decoder to visual tokens, leading to a phenomenon akin to "amnesia" about visual information.To address this issue, we propose MemVR, a novel decoding paradigm inspired by common cognition: when the memory of an image seen the moment before is forgotten, people will look at it again for factual answers. Following this principle, we treat visual tokens as supplementary evidence, reinjecting them into the MLLM through Feed Forward Network (FFN) as “key-value memory” at the middle trigger layer. This look-twice mechanism occurs when the model exhibits high uncertainty during inference, effectively enhancing factual alignment. Comprehensive experimental evaluations demonstrate that MemVR significantly mitigates hallucination across various MLLMs and excels in general benchmarks without incurring additional time overhead. Importantly, MemVR is a plug-and-play and task-agnostic method compatible with any MLLM, without extra training, emphasizing its widespread applicability.
Paperid:1222
Authors:Edoardo Urettini · Antonio Carta
Abstract:
Online Continual Learning (OCL) models continuously adapt to nonstationary data streams, usually without task information. These settings are complex and many traditional CL methods fail, while online methods (mainly replaybased) suffer from instabilities after the task shift. To address this issue, we formalize replay-based OCL as a second-order online joint optimization with explicit KL-divergence constraints on replay data. We propose Online Curvature-Aware Replay (OCAR) to solve the problem: a method that leverages second-order information of the loss using a K-FAC approximation of the Fisher Information Matrix (FIM) to precondition the gradient. The FIM acts as a stabilizer to prevent forgetting while also accelerating the optimization in non-interfering directions. We show how to adapt the estimation of the FIM to a continual setting stabilizing second-order optimization for non-iid data, uncovering the role of the Tikhonov regularization in the stability-plasticity tradeoff. Empirical results show that OCAR outperforms state-of-the-art methods in continual metrics achieving higher average accuracy throughout the training process in three different benchmarks.
Paperid:1223
Authors:Chunyu Xie · Bin Wang · Fanjing Kong · Jincheng Li · Dawei Liang · Gengshen Zhang · Dawei Leng · Yuhui Yin
Abstract:
Contrastive LanguageImage Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The related data, code, and models will be publicly released.
Paperid:1224
Authors:Xiangheng Wang · Ziquan Fang · Chenglong Huang · Danlei Hu · Lu Chen · Yunjun Gao
Abstract:
Trajectory representation learning aims to transform raw trajectory data into compact and lowdimensional vectors that are suitable for downstream analysis. However, most existing methods adopt either a free-space view or a road-network view during the learning process, which limits their ability to capture the complex, multi-view spatiotemporal features inherent in trajectory data. Moreover, these approaches rely on task-specific model training, restricting their generalizability and effectiveness for diverse analysis tasks. To this end, we propose GTR, a general, multi-view, and dynamic Trajectory Representation framework built on a pre-train and fine-tune architecture. Specifically, GTR introduces a multi-view encoder that captures the intrinsic multi-view spatiotemporal features. Based on the pre-train and fine-tune architecture, we provide the spatio-temporal fusion pre-training with a spatio-temporal mixture of experts to dynamically combine spatial and temporal features, enabling seamless adaptation to diverse trajectory analysis tasks. Furthermore, we propose an online frozen-hot updating strategy to efficiently update the representation model, accommodating the dynamic nature of trajectory data. Extensive experiments on two real-world datasets demonstrate that GTR consistently outperforms 12 state-of-the-art methods across 6 mainstream trajectory analysis tasks. All source code and data are available at https://anonymous.4open.science/r/GTRICML.
Paperid:1225
Authors:Sho Takemori · Yuhei Umeda · Aditya Gopalan
Abstract:
Abstract:This paper studies a pure exploration problem with linear bandit feedback on continuous arm sets, aiming to identify an $\epsilon$optimal arm with high probability. Previous approaches for continuous arm sets have employed instance-independent methods due to technical challenges such as the infinite dimensionality of the space of probability measures and the non-smoothness of the objective function. This paper proposes a novel, tractable algorithm that addresses these challenges by leveraging a reparametrization of the sampling distribution and projected subgradient descent. However, this approach introduces new challenges related to the projection and reconstruction of the distribution from the reparametrization. We address these by focusing on the connection to the approximate Carath\'eodory problem. Compared to the original optimization problem on the infinite-dimensional space, our method is tractable, requiring only the solution of quadratic and fractional quadratic problems on the arm set. We establish an instance-dependent optimality for our method, and empirical results on synthetic data demonstrate its superiority over existing instance-independent baselines.
Paperid:1226
Authors:Lingyu Li · Yixu Wang · Haiquan Zhao · Shuqi Kong · Yan Teng · Chunbo Li · Yingchun Wang
Abstract:
With large language models (LLMs) increasingly deployed as cognitive engines for AI agents, the reliability and effectiveness of such deployments critically hinges on their intrinsic agency, which remains understudied. Agency, the ability to flexibly reason, learn, and adapt in dynamic environments, represents a cognitivelevel capability independent of specific external modules or applications. We characterize the cognitive process underlying human agency, which unfolds in seven interrelated dimensions: prediction, decision-making, perception, memory, counterfactual thinking, belief updating, and meta-reflection. To rigorously evaluate LLMs' agency, we propose Reflection-Bench, a novel contamination-free benchmark consisting of seven parameterized cognitive tests. Through comprehensive evaluation of 16 models using three prompting strategies under entry-level difficulties, we identify a clear three-tier performance hierarchy and significant limitations particularly in meta-reflection capabilities. While state-of-the-art LLMs demonstrate rudimentary signs of intrinsic agency, our findings suggest several promising research directions including enhancing core cognitive functions, improving cross-functional coordination, and developing adaptive processing mechanisms.
Paperid:1227
Authors:Zixiang Chen · Greg Yang · Qingyue Zhao · Quanquan Gu
Abstract:
Abstract:Despite deep neural networks' powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.
Paperid:1228
Authors:Yegon Kim · Hyunsu Kim · Gyeonghoon Ko · Juho Lee
Abstract:
Accurately solving partial differential equations (PDEs) is critical to understanding complex scientific and engineering phenomena, yet traditional numerical solvers are computationally expensive. Surrogate models offer a more efficient alternative, but their development is hindered by the cost of generating sufficient training data from numerical solvers. In this paper, we present a novel framework for active learning (AL) in PDE surrogate modeling that reduces this cost. Unlike the existing AL methods for PDEs that always acquire entire PDE trajectories, our approach strategically generates only the most important time steps with the numerical solver, while employing the surrogate model to approximate the remaining steps. This dramatically reduces the cost incurred by each trajectory and thus allows the active learning algorithm to try out a more diverse set of trajectories given the same budget. To accommodate this novel framework, we develop an acquisition function that estimates the utility of a set of time steps by approximating its resulting variance reduction. We demonstrate the effectiveness of our method on several benchmark PDEs, including the Burgers' equation, Korteweg–De Vries equation, Kuramoto–Sivashinsky equation, the incompressible NavierStokes equation, and the compressible Navier-Stokes equation.Experiments show that our approach improves performance by large margins, by up to five times the best existing method. Our method not only reduces average error but also the 99\%, 95\%, and 50\% quantiles of error, which is rare for an AL algorithm. All in all, our approach offers a data-efficient solution to surrogate modeling for PDEs.
Paperid:1229
Authors:Juncheng Dong · Moyang Guo · Ethan Fang · Zhuoran Yang · Vahid Tarokh
Abstract:
Transformer models have achieved remarkable empirical successes, largely due to their incontext learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context reinforcement learning (ICRL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL tasks, and then fix and use this transformer to create an action policy for new RL tasks. Notably, we consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the Decision Importance Transformer (DIT), which emulates the actor-critic algorithm in an in-context manner. In particular, we first train a transformer-based value function that estimates the advantage functions of the behavior policies that collected the suboptimal trajectories. Then we train a transformer-based policy via a weighted maximum likelihood estimation loss, where the weights are constructed based on the trained value function to steer the suboptimal policies to the optimal ones. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.
Paperid:1230
Authors:Jacopo Graldi · Alessandro Breccia · Giulia Lanzillotta · Thomas Hofmann · Lorenzo Noci
Abstract:
Despite recent efforts, neural networks still struggle to learn in nonstationary environments, and our understanding of catastrophic forgetting is far from complete.In this work, we perform a systematic study on the impact of model scale and the degree of feature learning in continual learning. We reconcile existing contradictory observations on scale in the literature, by differentiating betweenlazyandrichtraining regimes through a variable parametrization of the architecture. We show that increasing model width is only beneficial when it reduces the amount offeature learning, yielding more laziness. Using the framework of dynamical mean field theory, we then study the infinite width dynamics of the model in the feature learning regime and characterize catastrophic forgetting, extending prior theoretical results limited to the lazy regime. We study the intricate relationship between feature learning, task non-stationarity, and forgetting, finding that high feature learning is only beneficial with highly similar tasks. Furthermore, we identify a transition - modulated by task similarity - where the model exits an effectively lazy regime with low forgetting to enter a rich regime with significant forgetting. We term this transition regionEdge of Laziness. Finally, our findings reveal that neural networks achieve optimal performance at a critical level of feature learning, which depends on task non-stationarity andtransfers across model scales. This work provides a unified perspective on the role of scale and feature learning in continual learning.
Paperid:1231
Authors:Arya Fayyazi · Mehdi Kamal · Massoud Pedram
Abstract:
We propose FACTER, a fairnessaware framework for LLM-based recommendation systems that integrates conformal prediction with dynamic prompt engineering. By introducing an adaptive semantic variance threshold and a violation-triggered mechanism, FACTER automatically tightens fairness constraints whenever biased patterns emerge. We further develop an adversarial prompt generator that leverages historical violations to reduce repeated demographic biases without retraining the LLM. Empirical results on MovieLens and Amazon show that FACTER substantially reduces fairness violations (up to 95.5%) while maintaining strong recommendation accuracy, revealing semantic variance as a potent proxy of bias.
Paperid:1232
Authors:Harry Mead · Clarissa Costen · Bruno Lacerda · Nick Hawes
Abstract:
When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines.
Paperid:1233
Authors:Thanh NguyenTang · Raman Arora
Abstract:
Abstract:We consider the policy regret minimization problem in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent that can adapt to the learner’s strategies. We establish a generic algorithmic framework that attains the optimal $\mathcal{O}(\sqrt{T})$ policy regret rate for a wide class of largescale problems characterized by Eluder-type conditions, which generalize the tabular environments considered in the previous work. Notably, our framework also reveals a much simpler yet effective algorithmic design to deal with reactive adversaries, highlighting the benefit of opponent learning in reactive environments to achieve $\mathcal{O}(\sqrt{T})$.
Paperid:1234
Authors:Peng Jin · Bo Zhu · Li Yuan · Shuicheng YAN
Abstract:
Abstract:In this work, we upgrade the multihead attention mechanism, the core of the Transformer model, to reduce computational costs while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%$\sim$90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.
Paperid:1235
Authors:Yu Yuan · Shizhao Sun · Qi Liu · Jiang Bian
Abstract:
Computer Aided Design (CAD) is indispensable across various industries. \emph{Textbased CAD editing}, which automates the modification of CAD models based on textual instructions, holds great potential but remains underexplored.Existing methods primarily focus on design variation generation or text-based CAD generation, either lacking support for text-based control or neglecting existing CAD models as constraints.We introduce \emph{CAD-Editor}, the first framework for text-based CAD editing. To address the challenge of demanding triplet data with accurate correspondence for training, we propose an automated data synthesis pipeline. This pipeline utilizes design variation models to generate pairs of original and edited CAD models and employs Large Vision-Language Models (LVLMs) to summarize their differences into editing instructions.To tackle the composite nature of text-based CAD editing, we propose a locate-then-infill framework that decomposes the task into two focused sub-tasks: locating regions requiring modification and infilling these regions with appropriate edits. Large Language Models (LLMs) serve as the backbone for both sub-tasks, leveraging their capabilities in natural language understanding and CAD knowledge.Experiments show that CAD-Editor achieves superior performance both quantitatively and qualitatively.
Paperid:1236
Authors:Xingyu Zhou · Yulian Wu · Wenqian Weng · Francesco Orabona
Abstract:
Abstract:In this paper, we theoretically study the offline alignment of language models with human preference feedback, under both preference label corruption and privacy protections. To this end, we propose a variant of \texttt{$\chi$PO} \texttt{Square}\texttt{$\chi$PO}, which is a simple one-line change of \texttt{$\chi$PO} with the standard log-loss replaced by a new square loss over probability. Thanks to the inherent nice properties of this new loss, we have advanced the state-of-the-art of differentially private and robust alignment. Specifically, for the local model of label privacy, \texttt{Square}\texttt{$\chi$PO} is the first one that attains optimal rate based on single-policy concentrability even with general function approximations. It also gives the first result under the central model of privacy protection over both prompts (responses) and labels. On the robustness side against Huber label corruption, \texttt{Square}\texttt{$\chi$PO} is the first alignment method that has a meaningful theoretical guarantee under general function approximations. More importantly, \texttt{Square}\texttt{$\chi$PO} can address privacy protection and corruption \emph{simultaneously}, where an interesting separation is observed, implying that the order of privacy and corruption matters. Furthermore, we show that \texttt{Square}\texttt{$\chi$PO} can also be easily extended to handle the scenario of the general preference model with state-of-the-art guarantees under corruption and privacy. Last but not least, all of our theoretical guarantees enjoy a unified analysis, building upon a new result on the generalization error bounds of least-square regression under corruption and privacy constraints, which we believe is of independent interest to the community.
Paperid:1237
Authors:Stéphane Aroca-Ouellette · Natalie Mackraz · Barry-John Theobald · Katherine Metcalf
Abstract:
Accommodating human preferences is essential for creating aligned LLM agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs acting as writing agents to infer a description of user preferences. Agent alignment then comes from conditioning on the inferred preference description. However, existing methods often produce generic preference descriptions that fail to capture the unique and individualized nature of human preferences. This paper introduces PROSE, a method designed to enhance the precision of preference descriptions inferred from user writing samples. PROSE incorporates two key elements: (1) iterative refinement of inferred preferences, and (2) verification of inferred preferences across multiple user writing samples. We evaluate PROSE with several LLMs (i.e., Qwen2.5 7B and 72B Instruct, GPTmini, and GPT-4o) on a summarization and an email writing task. We find that PROSE more accurately infers nuanced human preferences, improving the quality of the writing agent's generations over CIPHER (a state-of-the-art method for inferring preferences) by 33\%. Lastly, we demonstrate that ICL and PROSE are complementary methods, and combining them provides up to a 9\% improvement over ICL alone.
Paperid:1238
Authors:Jumman Hossain · Nirmalya Roy
Abstract:
Abstract:Reinforcement learning (RL) in environments with asymmetric traversal costswhere actions such as uphill vs. downhill motion incur direction-dependent penalties---remains a critical challenge for real-world applications. While quasimetric RL methods have relaxed symmetry assumptions, they fail to model path-dependent costs or provide safety guarantees. We propose Quasi-Potential Reinforcement Learning (QPRL), a framework that decomposes asymmetric costs into path-independent potentials and path-dependent residuals, enabling Lyapunov-stable policy optimization. Theoretically, we prove QPRL converges to an optimal policy with $\tilde{\mathcal{O}}(\sqrt{T})$ sample complexity, improving over prior quasimetric RL's $\tilde{\mathcal{O}}(T)$ bounds. Algorithmically, QPRL introduces a Lyapunov-based recovery mechanism that reduces irreversible constraint violations by $\mathbf{4\times}$ compared to baselines. Empirically, QPRL achieves state-of-the-art performance in diverse environments. Our results demonstrate that explicitly modeling path-dependent costs via quasi-potential decomposition enables safer, more efficient RL in complex navigation tasks with asymmetric costs.
Paperid:1239
Authors:Thieu Vo · Viet Hoang Tran · Tho Tran Huu · An Nguyen The · Thanh Tran · Minh-Khoi Nguyen-Nhat · Duy-Tung Pham · Tan Nguyen
Abstract:
A neural functional network (NFN) is a specialized type of neural network designed to process and learn from entire neural networks as input data. Recent NFNs have been proposed with permutation and scaling equivariance based on either graphbased message-passing mechanisms or parameter-sharing mechanisms. Compared to graph-based models, parameter-sharing-based NFNs built upon equivariant linear layers exhibit lower memory consumption and faster running time. However, their expressivity is limited due to the large size of the symmetric group of the input neural networks. The challenge of designing a permutation and scaling equivariant NFN that maintains low memory consumption and running time while preserving expressivity remains unresolved. In this paper, we propose a novel solution with the development of MAGEP-NFN (Monomial mAtrixGroupEquivariantPolynomialNFN). Our approach follows the parameter-sharing mechanism but differs from previous works by constructing a nonlinear equivariant layer represented as a polynomial in the input weights. This polynomial formulation enables us to incorporate additional relationships between weights from different input hidden layers, enhancing the model's expressivity while keeping memory consumption and running time low, thereby addressing the aforementioned challenge. We provide empirical evidence demonstrating that MAGEP-NFN achieves competitive performance and efficiency compared to existing baselines.
Paperid:1240
Authors:Kelly Ramsay · Jairo Diaz-Rodriguez
Abstract:
Despite the potential of differentially private data visualization to harmonize data analysis and privacy, research in this area remains underdeveloped. Boxplots are a widely popular visualization used for summarizing a dataset and for comparison of multiple datasets. Consequentially, we introduce a differentially private boxplot. We evaluate its effectiveness for displaying location, scale, skewness and tails of a given empirical distribution. In our theoretical exposition, we show that the location and scale of the boxplot are estimated with optimal sample complexity, and the skewness and tails are estimated consistently, which is not always the case for a boxplot naively constructed from a single existing differentially private quantile algorithm. As a byproduct of this exposition, we introduce several new results concerning private quantile estimation. In simulations, we show that this boxplot performs similarly to a nonprivate boxplot, and it outperforms the naive boxplot. Additionally, we conduct a real data analysis of Airbnb listings, which shows that comparable analysis can be achieved through differentially private boxplot visualization.
Paperid:1241
Authors:Edward Milsom · Ben Anson · Laurence Aitchison
Abstract:
We consider layerwise functionspace learning rates, which measure the magnitude of the change in a neural network's output function in response to an update to a parameter tensor. This contrasts with traditional learning rates, which describe the magnitude of changes in parameter space. We develop efficient methods to measure and set function-space learning rates in arbitrary neural networks, requiring only minimal computational overhead through a few additional backward passes that can be performed at the start of, or periodically during, training. We demonstrate two key applications: (1) analysing the dynamics of standard neural network optimisers in function space, rather than parameter space, and (2) introducing FLeRM (Function-space Learning Rate Matching), a novel approach to hyperparameter transfer across model scales. FLeRM records function-space learning rates while training a small, cheap base model, then automatically adjusts parameter-space layerwise learning rates when training larger models to maintain consistent function-space updates. FLeRM gives hyperparameter transfer across model width, depth, initialisation scale, and LoRA rank in various architectures including MLPs with residual connections and transformers with different layer normalisation schemes.
Paperid:1242
Authors:Xinyue Chen · Jinfeng Peng · Yuhao Li · Xiaorong Pu · Yang Yang · Yazhou Ren
Abstract:
Recently, federated multiview clustering (FedMVC) has gained attention for its ability to mine complementary clustering structures from multiple clients without exposing private data. Existing methods mainly focus on addressing the feature heterogeneity problem brought by views on different clients and mitigating it using shared client information. Although these methods have achieved performance improvements, the information they choose to share, such as model parameters or intermediate outputs, inevitably raises privacy concerns. In this paper, we propose an Effective and Secure Federated Multi-view Clustering method, ESFMC, to alleviate the dilemma between privacy protection and performance improvement. This method leverages the information-theoretic perspective to split the features extracted locally by clients, retaining sensitive information locally and only sharing features that are highly relevant to the task. This can be viewed as a form of privacy-preserving information sharing, reducing privacy risks for clients while ensuring that the server can mine high-quality global clustering structures. Additionally, we extend the proposed method to incomplete scenarios by designing a collaborative alignment strategy that ensures the consistency of locally shared information across clients from the global perspective. Theoretical analysis and extensive experiments demonstrate that the proposed method more effectively mitigates the trade-off between privacy protection and performance improvement compared to state-of-the-art methods.
Paperid:1243
Authors:Bo Yue · Jian Li · Guiliang Liu
Abstract:
Optimizing objective functions subject to constraints is fundamental in many realworld applications. However, these constraints are often not readily defined and must be inferred from expert agent behaviors, a problem known as Inverse Constraint Inference. Inverse Constrained Reinforcement Learning (ICRL) is a common solver for recovering feasible constraints in complex environments, relying on training samples collected from interactive environments. However, the efficacy and efficiency of current sampling strategies remain unclear. We propose a strategic exploration framework for sampling with guaranteed efficiency to bridge this gap. By defining the feasible cost set for ICRL problems, we analyze how estimation errors in transition dynamics and the expert policy influence the feasibility of inferred constraints. Based on this analysis, we introduce two exploratory algorithms to achieve efficient constraint inference via 1) dynamically reducing the bounded aggregate error of cost estimations or 2) strategically constraining the exploration policy around plausibly optimal ones. Both algorithms are theoretically grounded with tractable sample complexity, and their performance is validated empirically across various environments.
Paperid:1244
Authors:Yuzhou Gu · Zhao Song · Junze Yin
Abstract:
Abstract:Softmax distributions are widely used in machine learning, including Large Language Models (LLMs) where the attention unit uses softmax distributions. We abstract the attention unit as the softmax model, where given a vector input, the model produces an output drawn from the softmax distribution (which depends on the vector input). We consider the fundamental problem of binary hypothesis testing in the setting of softmax models. That is, given an unknown softmax model, which is known to be one of the two given softmax models, how many queries are needed to determine which one is the truth? We show that the sample complexity is asymptotically $O(\epsilon^{2})$ where $\epsilon$ is a certain distance between the parameters of the models.Furthermore, we draw an analogy between the softmax model and the leverage score model, an important tool for algorithm design in linear algebra and graph theory. The leverage score model, on a high level, is a model which, given vector input, produces an output drawn from a distribution dependent on the input. We obtain similar results for the binary hypothesis testing problem for leverage score models.
Paperid:1245
Authors:Masahiro Fujisawa · Futoshi Futami
Abstract:
Nonparametric estimation with binning is widely used for evaluating calibration error and recalibrating machine learning models.However, existing theoretical analyses of the bias induced by binning are limited to binary classification, creating a significant gap with practical applications such as multiclass classification. Additionally, many recalibration algorithms lack theoretical guarantees for their generalization performance.To address these issues, we conduct a generalization analysis of calibration error using the probably approximately correct Bayes framework. This approach enables us to derive the first optimizable upper bound for generalization error in the calibration context. On the basis of our theory, we propose a generalizationaware recalibration algorithm. Numerical experiments show that our algorithm enhances the performance of Gaussian process-based recalibration across various benchmark datasets and models.
Paperid:1246
Authors:Fengbin Guan · Xin Li · Zihao Yu · Yiting Lu · Zhibo Chen
Abstract:
In this work, we take the first exploration of the recently popular foundation model,i.e., State Space Model/Mamba, in image quality assessment (IQA), aiming at observing and excavating the perception potential in vision Mamba. A series of works on Mamba has shown its significant potential in various fields,e.g., segmentation and classification. However, the perception capability of Mamba has been underexplored. Consequently, we propose QMamba by revisiting and adapting the Mamba model for three crucial IQA tasks,i.e., task-specific, universal, and transferable IQA, which reveals that the Mamba model has obvious advantages compared with existing foundational models,e.g., Swin Transformer, ViT, and CNNs, in terms of perception and computational cost for IQA. To increase the transferability of QMamba, we propose the StylePrompt tuning paradigm, where the basic lightweight mean and variance prompts are injected to assist the task-adaptive transfer learning of pre-trained QMamba for different downstream IQA tasks. Compared with existing prompt tuning strategies, our proposed StylePrompt enables better perception transfer capability with less computational cost. Extensive experiments on multiple synthetic, authentic IQA datasets, and cross IQA datasets have demonstrated the effectiveness of our proposed QMamba.
Paperid:1247
Authors:Yoon Hyeok Lee · Jaemin Park · Taejin Paik · Doyun Kim · Bosun Hwang
Abstract:
Graph Transformers (GTs) have emerged as a powerful alternative to messagepassing neural networks, yet their performance heavily depends on effectively embedding structural inductive biases. In this work, we introduce novel structural encodings (SEs) grounded in a rigorous analysis of random walks (RWs), leveraging Green and Martin kernels that we have carefully redefined for AI applications while preserving their mathematical essence. These kernels capture the long-term behavior of RWs on graphs and allow for enhanced representation of complex topologies, including non-aperiodic and directed acyclic substructures. Empirical evaluations across eight benchmark datasets demonstrate that our method achieves state-of-the-art results in seven tasks, notably in molecular and circuit domains. We attribute this performance boost to the improved ability of our kernel-based SEs to encode intricate structural information, thereby strengthening the global attention and inductive bias within GTs. This work underlines the promise of theoretically grounded, AI-ready kernels in advancing the capability and scope of Transformer architectures for graph-based learning.
Paperid:1248
Authors:Chhavi Yadav · Evan Laufer · Dan Boneh · Kamalika Chaudhuri
Abstract:
In principle, explanations are intended as a way to increase trust in machine learning models and are often obligated by regulations. However, many circumstances where these are demanded are adversarial in nature, meaning the involved parties have misaligned interests and are incentivized to manipulate explanations for their purpose. As a result, explainability methods fail to be operational in such settings despite the demand \cite{bordt2022post}. In this paper, we take a step towards operationalizing explanations in adversarial scenarios with ZeroKnowledge Proofs (ZKPs), a cryptographic primitive. Specifically we explore ZKP-amenable versions of the popular explainability algorithm LIME and evaluate their performance on Neural Networks and Random Forests. Our code is publicly available at : \url{/post/accept}.
Paperid:1249
Authors:Tianqing Zhang · Kairong Yu · Qi Xu · Gang Pan · Hongwei Wang
Abstract:
Spiking Neural Networks (SNNs) are increasingly recognized for their biological plausibility and energy efficiency, positioning them as strong alternatives to Artificial Neural Networks (ANNs) in neuromorphic computing applications. SNNs inherently process temporal information by leveraging the precise timing of spikes, but balancing temporal feature utilization with low energy consumption remains a challenge. In this work, we introduce Temporal Shift module for Spiking Neural Networks (TSSNN), which incorporates a novel Temporal Shift (TS) module to integrate past, present, and future spike features within a single timestep via a simple yet effective shift operation. A residual combination method prevents information loss by integrating shifted and original features. The TS module is lightweight, requiring only one additional learnable parameter, and can be seamlessly integrated into existing architectures with minimal additional computational cost. TS-SNN achieves state-of-the-art performance on benchmarks like CIFAR-10 (96.72\%), CIFAR-100 (80.28\%), and ImageNet (70.61\%) with fewer timesteps, while maintaining low energy consumption. This work marks a significant step forward in developing efficient and accurate SNN architectures. The code is available on GitHub.
Paperid:1250
Authors:Nikita Kazeev · Wei Nong · Ignat Romanov · Ruiming Zhu · Andrey Ustyuzhanin · Shuya Yamazaki · Kedar Hippalgaonkar
Abstract:
Symmetry rules that atoms obey when they bond together to form an ordered crystal play a fundamental role in determining their physical, chemical, and electronic properties such as electrical and thermal conductivity, optical and polarization behavior, and mechanical strength. Almost all known crystalline materials have internal symmetry. Consistently generating stable crystal structures is still an open challenge, specifically because such symmetry rules are not accounted for. To address this issue, we propose WyFormer, a generative model for materials conditioned on space group symmetry. We use Wyckoff positions as the basis for an elegant, compressed, and discrete structure representation. To model the distribution, we develop a permutationinvariant autoregressive model based on the Transformer and an absence of positional encoding. WyFormer has a unique and powerful synergy of attributes, proven by extensive experimentation: best-in-class symmetry-conditioned generation, physics-motivated inductive bias, competitive stability of the generated structures, competitive material property prediction quality, and unparalleled inference speed.
Paperid:1251
Authors:Tim Vieira · Tianyu Liu · Clemente Pasti · Yahya Emara · Brian DuSell · Benjamin LeBrun · Mario Giulianelli · Juan Luis Gastaldi · John Terilla · Timothy O'Donnell · Ryan Cotterell
Abstract:
Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as bytepair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of \emph{noncanonical} token encodings of each character string—these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We show that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.
Paperid:1252
Authors:Kosuke Sugiyama · Masato Uchida
Abstract:
This paper addresses weak features learning (WFL), focusing on learning scenarios characterized by lowquality input features (weak features; WFs) that arise due to missingness, measurement errors, or ambiguous observations. We present a theoretical formalization and error analysis of WFL for continuous WFs (continuous WFL), which has been insufficiently explored in existing literature. A previous study established formalization and error analysis for WFL with discrete WFs (discrete WFL); however, this analysis does not extend to continuous WFs due to the inherent constraints of discreteness. To address this, we propose a theoretical framework specifically designed for continuous WFL, systematically capturing the interactions between feature estimation models for WFs and label prediction models for downstream tasks. Furthermore, we derive the theoretical conditions necessary for both sequential and iterative learning methods to achieve consistency. By integrating the findings of this study on continuous WFL with the existing theory of discrete WFL, we demonstrate that the WFL framework is universally applicable, providing a robust theoretical foundation for learning with low-quality features across diverse application domains.
Paperid:1253
Authors:Jianwei Li · Jung-Eun Kim
Abstract:
Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially, leaving models vulnerable to various adversarial attacks. Despite their significance, these studies generally fail to offer actionable solutions beyond data augmentation for achieving more robust safety mechanisms. This paper identifies a fundamental cause of this superficiality: existing alignment approaches often presume that models can implicitly learn a safetyrelated reasoning task during the alignment process, enabling them to refuse harmful requests. However, the learned safety signals are often diluted by other competing objectives, leading models to struggle with drawing a firm safety-conscious decision boundary when confronted with adversarial attacks. Based on this observation, by explicitly introducing a safety-related binary classification task and integrating its signals with our attention and decoding strategies, we eliminate this ambiguity and allow models to respond more responsibly to malicious queries. We emphasize that, with less than 0.2x overhead cost, our approach enables LLMs to assess the safety of both the query and the previously generated tokens at each necessary generating step. Extensive experiments demonstrate that our method significantly improves the resilience of LLMs against various adversarial attacks, offering a promising pathway toward more robust generative AI systems.
Paperid:1254
Authors:Kangyu Zhu · Peng Xia · Yun Li · Hongtu Zhu · Sheng Wang · Huaxiu Yao
Abstract:
The advancement of Large VisionLanguage Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately addressed clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. In response, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by 14.2% and 51.7% on the Med-VQA and report generation tasks, respectively.
Paperid:1255
Authors:Jiachen Hu · Rui Ai · Han Zhong · Xiaoyu Chen · Liwei Wang · Zhaoran Wang · Zhuoran Yang
Abstract:
Abstract:Information asymmetry is a pervasive feature of multiagent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non-i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample-efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an $\epsilon$-optimal policy with a tight sample complexity of $\tilde{O}(1/\epsilon^2)$.
Paperid:1256
Authors:Ashkan Soleymani · Behrooz Tahmasebi · Stefanie Jegelka · Patrick Jaillet
Abstract:
We study the statisticalcomputational trade-offs for learning with exact invariances (or symmetries) using kernel regression. Traditional methods, such as data augmentation, group averaging, canonicalization, and frame-averaging, either fail to provide a polynomial-time solution or are not applicable in the kernel setting. However, with oracle access to the geometric properties of the input space, we propose a polynomial-time algorithm that learns a classifier with \emph{exact} invariances. Moreover, our approach achieves the same excess population risk (or generalization error) as the original kernel regression problem. To the best of our knowledge, this is the first polynomial-time algorithm to achieve exact (not approximate) invariances in this context. Our proof leverages tools from differential geometry, spectral theory, and optimization. A key result in our development is a new reformulation of the problem of learning under invariances as optimizing an infinite number of linearly constrained convex quadratic programs, which may be of independent interest.
Paperid:1257
Authors:Omar Bennouna · Jiawei Zhang · Saurabh Amin · Asuman Ozdaglar
Abstract:
Contextual optimization problems are prevalent in decisionmaking applications where historical data and contextual features are used to learn predictive models that inform optimal actions. However, practical applications often suffer from model misspecification due to incomplete knowledge of the underlying data-generating process, leading to suboptimal decisions. Existing approaches primarily address the well-specified case, leaving a critical gap in handling misspecified models. In this paper, we propose a novel Integrated Learning and Optimization (ILO) framework that explicitly accounts for model misspecification by introducing a tractable surrogate loss function with strong theoretical guarantees on generalizability, tractability, and optimality. Our surrogate loss aligns with the true decision performance objective, ensuring robustness to misspecification without imposing restrictive assumptions. The proposed approach effectively mitigates the challenges of non-convexity and non-smoothness in the target loss function, leading to efficient optimization procedures. We provide rigorous theoretical analysis and experimental validation, demonstrating superior performance compared to state-of-the-art methods. Our work offers a principled solution to the practically relevant challenge of model misspecification in contextual optimization.
Paperid:1258
Authors:Shengsheng Lin · Haojun Chen · Haijie Wu · Chunyun Qiu · Weiwei Lin
Abstract:
Sufficiently modeling the correlations among variables (aka channels) is crucial for achieving accurate multivariate time series forecasting (MTSF). In this paper, we propose a novel technique called Temporal Query (TQ) to more effectively capture multivariate correlations, thereby improving model performance in MTSF tasks. Technically, the TQ technique consists of periodically shifted learnable vectors that adaptively represent the relationships among variables. Building upon the TQ technique, we introduce a simple yet efficient model named Temporal Query Network (TQNet), which employs only a singlelayer attention mechanism and a lightweight multi-layer perceptron (MLP) module. Extensive experiments demonstrate that TQNet learns more robust multivariate correlations, achieving state-of-the-art forecasting accuracy across 12 challenging real-world datasets. Moreover, TQNet achieves high efficiency comparable to linear-based methods even in high-dimensional datasets, balancing performance and computational cost. The anonymous code is available at: https://anonymous.4open.science/r/TQNet-8a4o.
Paperid:1259
Authors:Chengrui Gao · Haopu Shang · Ke Xue · Chao Qian
Abstract:
Machine learning has increasingly been employed to solve NPhard combinatorial optimization problems, resulting in the emergence of neural solvers that demonstrate remarkable performance, even with minimal domain-specific knowledge. To date, the community has created numerous open-source neural solvers with distinct motivations and inductive biases. While considerable efforts are devoted to designing powerful single solvers, our findings reveal that existing solvers typically demonstrate complementary performance across different problem instances. This suggests that significant improvements could be achieved through effective coordination of neural solvers at the instance level. In this work, we propose the first general framework to coordinate the neural solvers, which involves feature extraction, selection model, and selection strategy, aiming to allocate each instance to the most suitable solvers. To instantiate, we collect several typical neural solvers with state-of-the-art performance as alternatives, and explore various methods for each component of the framework. We evaluated our framework on two typical problems, Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP). Experimental results show that our framework can effectively distribute instances and the resulting composite solver can achieve significantly better performance (e.g., reduce the optimality gap by 0.88% on TSPLIB and 0.71% on CVRPLIB) than the best individual neural solver with little extra time cost.
Paperid:1260
Authors:Rohan Deb · Kiran Thekumparampil · Kousha Kalantari · Gaurush Hiranandani · Shoham Sabach · Branislav Kveton
Abstract:
Supervised finetuning (SFT) is a common method for adapting large language models (LLMs) to new domains. In this work, we try to improve the statistical efficiency of SFT by selecting an informative subset of training examples. Specifically, for a fixed budget of training examples, which determines the computational cost of fine-tuning, we want to train on the most informative ones. The key idea in our method is to select examples that maximize the Hessian of the log-likelihood of the LLM. We approximate it efficiently based on making a connection to uncertainty modeling in multinomial logistic regression models. Our approach is computationally efficient, analyzable, and performs well empirically. We demonstrate this on several problems, and back our claims with both quantitative results and an LLM evaluation.
Paperid:1261
Authors:Hao Chen · Yujin Han · Fangyi Chen · Xiang Li · Yidong Wang · Jindong Wang · Ze Wang · Zicheng Liu · Difan Zou · Bhiksha Raj
Abstract:
Recent advances in latent diffusion models have demonstrated their effectiveness for highresolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76× faster training and 31× higher inference throughput for 512×512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models will be released.
Paperid:1262
Authors:Jing Qiao · Yu Liu · YUAN YUAN · Xiao Zhang · Zhipeng Cai · Dongxiao Yu
Abstract:
Abstract:This paper examines the theoretical performance of distributed diffusion models in environments where computational resources and data availability vary significantly among workers. Traditional models centered on singleworker scenarios fall short in such distributed settings, particularly when some workers are resource-constrained. This discrepancy in resources and data diversity challenges the assumption of accurate score function estimation foundational to single-worker models. We establish the inaugural generation error bound for distributed diffusion models in resource-limited settings, establishing a linear relationship with the data dimension $d$ and consistency with established single-worker results. Our analysis highlights the critical role of hyperparameter selection in influencing the training dynamics, which are key to the performance of model generation. This study provides a streamlined theoretical approach to optimizing distributed diffusion models, paving the way for future research in this area.
Paperid:1263
Authors:Lan Li · Da-Wei Zhou · Han-Jia Ye · De-Chuan Zhan
Abstract:
DomainIncremental Learning (DIL) focuses on continual learning in non-stationary environments, requiring models to adjust to evolving domains while preserving historical knowledge. DIL faces two critical challenges in the context of imbalanced data: intra-domain class imbalance and cross-domain class distribution shifts. These challenges significantly hinder model performance, as intra-domain imbalance leads to underfitting of few-shot classes, while cross-domain shifts require maintaining well-learned many-shot classes and transferring knowledge to improve few-shot class performance in new domains. To overcome these challenges, we introduce the Dual-Balance Collaborative Experts (DCE) framework. DCE employs a frequency-aware expert group, trained with specialized loss functions to decouple feature learning and balance the representation of many-shot and few-shot classes thus alleviating intra-domain class imbalance. Subsequently, a dynamic expert selector is learned by synthesizing pseudo-features through balanced Gaussian sampling from historical class statistics. This mechanism balances the retention of many-shot classes and promotes cross-domain knowledge transfer, enhancing few-shot class performance in new domains. Extensive experimental results on four benchmark datasets demonstrate DCE’s state-of-the-art performance.
Paperid:1264
Authors:Zihan Song · Xin Wang · Zi Qian · Hong Chen · Longtao Huang · Hui Xue' · Wenwu Zhu
Abstract:
Multimodal Large Language Models (Multimodal LLMs) have shown their strength in Video Question Answering (VideoQA). However, due to the blackbox nature of end-to-end training strategies, existing approaches based on Multimodal LLMs suffer from the lack of interpretability for VideoQA: they can neither present reasoning paths nor indicate where the answers are derived from the video. To address this issue, we proposeMSR-ViR(ModularizedSelf-ReflectedVideoReasoner), which for the first time integrates modular networks to Multimodal LLMs, capable of providing VideoQA with explicit reasoning paths for more interpretability. Specifically, aMoST-Grounding(Modularized Spatial-Temporal Grounding) network is proposed to decompose complex questions via tree-structured policies, localizing relevant temporal and spatial segments within videos through step-by-step reasoning. The proposed MoST-Grounding network provides explicit visually grounded information for Multimodal LLMs with clear reasoning paths, thus enhancing interpretability for the predicted answers. To further improve the reasoning quality, we design anAlternate Self-reflection Training Strategyto jointly optimize policy generation and Multimodal LLMs. Experiments on real-world datasets demonstrate the superiority of our proposed MSR-ViR framework in video understanding, reasoning transparency, and providing explicit localization evidence for answers.
Paperid:1265
Authors:Jan Disselhoff · Daniel Franzen · David Hartmann
Abstract:
The Abstraction and Reasoning Corpus (ARCAGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6\% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware.
Paperid:1266
Authors:Mingyang Wu · Li Lin · Wenbin Zhang · Xin Wang · Zhenhuan Yang · Shu Hu
Abstract:
The Area Under the ROC Curve (AUC) is a key metric for classification, especially under class imbalance, with growing research focus on optimizing AUC over accuracy in applications like medical image analysis and deepfake detection. This leads to fairness in AUC optimization becomes crucial as biases can impact protected groups. While various fairness mitigation techniques exist, fairness considerations in AUC optimization remain in their early stages, with most research focusing on improving AUC fairness under the assumption of clean protected groups. However, these studies often overlook the impact of noisy protected groups, leading to fairness violations in practice. To address this, we propose the first robust AUC fairness approach under noisy protected groups with fairness theoretical guarantees using distributionally robust optimization. Extensive experiments on tabular and image datasets show that our method outperforms stateof-the-art approaches in preserving AUC fairness. The code is in the supplementary material.
Paperid:1267
Authors:Dongsu Lee · Minhae Kwon
Abstract:
The goal of offline reinforcement learning (RL) is to extract the best possible policy from the previously collected dataset considering theoutof-distribution(OOD) sample issue. Offline model-based RL (MBRL) is a captivating solution capable of alleviating such issues through a \textit{state-action transition augmentation} with a learned dynamic model. Unfortunately, offline MBRL methods have been observed to fail in sparse rewarded and long-horizon environments for a long time. In this work, we propose a novel MBRL method, dubbed Temporal Distance-Aware Transition Augmentation (TempDATA), that generates additional transitions in a geometrically structured representation space, instead of state space. For comprehending long-horizon behaviors efficiently, our main idea is to learn state abstraction, which captures atemporal distancefrom bothtrajectory and transition levelsof state space. Our experiments empirically confirm that TempDATA outperforms previous offline MBRL methods and achieves matching or surpassing the performance of diffusion-based trajectory augmentation and goal-conditioned RL on the D4RL AntMaze, FrankaKitchen, CALVIN, and pixel-based FrankaKitchen.
Paperid:1268
Authors:Anselm Paulus · Arman Zharmagambetov · Chuan Guo · Brandon Amos · Yuandong Tian
Abstract:
Large Language Models (LLMs) are vulnerable tojailbreaking attacksthat lead to generation of inappropriate or harmful content. Manual redteaming requires a time-consuming search for adversarial prompts, whereas automatic adversarial prompt generation often leads to semantically meaningless attacks that do not scale well.In this paper, we present a novel method that uses another LLM, calledAdvPrompter, to generate human-readable adversarial prompts in seconds.AdvPrompter, which is trained using an alternating optimization algorithm, generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response.Experimental results on popular open source TargetLLM show highly competitive results on the AdvBench and HarmBench datasets, that also transfer to closed-source black-box LLMs.We also show that training on adversarial suffixes generated by AdvPrompter is a promising strategy for improving the robustness of LLMs to jailbreaking attacks.
Paperid:1269
Authors:Jie Peng · Hongwei Yang · Jing Zhao · Hengji Dong · Hui He · Weizhe Zhang · Haoyu He
Abstract:
Deep neural networks are vulnerable to backdoor attacks, where malicious behaviors are implanted during training. While existing defenses can effectively purify compromised models, they typically require labeled data or specific training procedures, making them difficult to apply beyond supervised learning settings. Recent studies have shown successful backdoor attacks across various learning paradigms, highlighting a critical security concern. To address this gap, we propose Twostage Symmetry Connectivity (TSC), a novel backdoor purification defense that operates independently of data format and requires only a small fraction of clean samples. Through theoretical analysis, we prove that by leveraging permutation invariance in neural networks and quadratic mode connectivity, TSC amplifies the loss on poisoned samples while maintaining bounded clean accuracy. Experiments demonstrate that TSC achieves robust performance comparable to state-of-the-art methods in supervised learning scenarios. Furthermore, TSC generalizes to self-supervised learning frameworks, such as SimCLR and CLIP, maintaining its strong defense capabilities.
Paperid:1270
Authors:Yuanhuiyi Lyu · Xu Zheng · Lutao Jiang · Yibo Yan · Xin Zou · Huiyu Zhou · Linfeng Zhang · Xuming Hu
Abstract:
Recent textto-image generative models, e.g., Stable Diffusion V3 and Flux, have achieved notable progress. However, these models are strongly restricted to their limited knowledge, a.k.a., their own fixed parameters, that are trained with closed datasets. This leads to significant hallucinations or distortions when facing fine-grained and unseen novel real-world objects, e.g., the appearance of the Tesla Cybertruck. To this end, we presentthe firstreal-object-based retrieval-augmented generation framework (RealRAG), which augments fine-grained and unseen novel object generation by learning and retrieving real-world images to overcome the knowledge gaps of generative models. Specifically, to integrate missing memory for unseen novel object generation, we train a reflective retriever byself-reflective contrastive learning, which injects the generator's knowledge into the sef-reflective negatives, ensuring that the retrieved augmented images compensate for the model's missing knowledge. Furthermore, the real-object-based framework integrates fine-grained visual knowledge for the generative models, tackling the distortion problem and improving the realism for fine-grained object generation. Our Real-RAG is superior in its modular application toall typesof state-of-the-art text-to-image generative models and also deliversremarkableperformance boosts with all of them, such as again of16.18\%FID scorewith the auto-regressive model on the Stanford Car benchmark.
Paperid:1271
Authors:Yinhan He · Wendy Zheng · Yushun Dong · Yaochen Zhu · Chen Chen · Jundong Li
Abstract:
Mechanistic interpretability (MI) research aims to understand large language models (LLMs) by identifying computational circuits, subgraphs of model components with associated functional interpretations, that explain specific behaviors. Current MI approaches focus on discovering taskspecific circuits, which has two key limitations: (1) poor generalizability across different language tasks, and (2) high costs associated with requiring human or advanced LLM interpretation of each computational node. To address these challenges, we propose developing a ``modular circuit (MC) vocabulary'' consisting of task-agnostic functional units. Each unit consists of a small computational subgraph with its interpretation. This approach enables global interpretability by allowing different language tasks to share common MCs, while reducing costs by reusing established interpretations for new tasks. We establish five criteria for characterizing the MC vocabulary and present ModCirc, a novel global-level mechanistic interpretability framework for discovering MC vocabularies in LLMs. We demonstrate ModCirc's effectiveness on Med-LLaMA (8B), which successfully identifys modular circuits that perform well on our proposed quality metrics. Code is available at https://anonymous.4open.science/r/ModCirc-4887/.
Paperid:1272
Authors:Qiangqiang Zhang · Ting Li · Xinwei Feng · Xiaodong Yan · Jinhan Xie
Abstract:
Traditional conformal prediction faces significant challenges with the rise of data streaming and increasing concerns over privacy. In this paper, we introduce a novel online differential private conformal prediction framework, designed to construct dynamic, modelfree private prediction sets. Unlike existing approaches that either disregard privacy or require full access to the entire dataset, our proposed method ensures individual privacy with a one-pass algorithm, ideal for real-time, privacy-preserving decision-making. Theoretically, we establish guarantees for long-run coverage at the nominal confidence level. Moreover, we extend our method to conformal quantile regression, which is fully adaptive to heteroscedasticity. We validate the effectiveness and applicability of the proposed method through comprehensive simulations and real-world studies on the ELEC2 and PAMAP2 datasets.
Paperid:1273
Authors:Piyush Anand · Piotr Indyk · Ravishankar Krishnaswamy · Sepideh Mahabadi · Vikas Raykar · Kirankumar Shiragur · Haike Xu
Abstract:
Abstract:Nearest neighbor search is a fundamental data structure problem with many applications in machine learning, computer vision, recommendation systems and other fields. Although the main objective of the data structure is to quickly report data points that are closest to a given query, it has long been noted ~\citep{carbonell1998use} that without additional constraints the reported answers can be redundant and/or duplicative. This issue is typically addressed in two stages: in the first stage, the algorithm retrieves a (large) number $r$ of points closest to the query, while in the second stage, the $r$ points are postprocessed and a small subset is selected to maximize the desired diversity objective. Although popular, this method suffers from a fundamental efficiency bottleneck, as the set of points retrieved in the first stage often needs to be much larger than the final output.In this paper we present provably efficient algorithms for approximate nearest neighbor search with diversity constraints that bypass this two stage process. Our algorithms are based on popular graph-based methods, which allows us to ``piggy-back'' on the existing efficient implementations. These are the first graph-based algorithms for nearest neighbor search with diversity constraints. For data sets with low intrinsic dimension, our data structures report a diverse set of $k$ points approximately closest to the query, in time that only depends on $k$ and $\log \Delta$, where $\Delta$ is the ratio of the diameter to the closest pair distance in the data set. This bound is qualitatively similar to the best known bounds for standard (non-diverse) graph-based algorithms. Our experiments show that the search time of our algorithms is substantially lower than that using the standard two-stage approach.
Paperid:1274
Authors:Makoto Takamoto · Viktor Zaverkin · Mathias Niepert
Abstract:
Machine learning plays an increasingly important role in computational chemistry and materials science, complementing computationally intensive ab initio and firstprinciples methods. Despite their utility, machine-learning models often lack generalization capability and robustness during atomistic simulations, yielding unphysical energy and force predictions that hinder their real-world applications. We address this challenge by introducing a physics-informed, weakly supervised approach for training machine-learned interatomic potentials (MLIPs). We introduce two novel loss functions, extrapolating the potential energy via a Taylor expansion and using the concept of conservative forces. Our approach improves the accuracy of MLIPs applied to training tasks with sparse training data sets and reduces the need for pre-training computationally demanding models with large data sets. Particularly, we perform extensive experiments demonstrating reduced energy and force errors---often lower by a factor of two---for various baseline models and benchmark data sets. Moreover, we demonstrate improved robustness during MD simulations of the MLIP models trained with the proposed weakly supervised loss. Finally, we show that our method can be applied for the real-world application where foundation models are finetuned on sparse but highly accurarate data. An implementation of our method and scripts for executing experiments are available at \url{https://anonymous.4open.science/r/PICPS-ML4Sci-1E8F}.
Paperid:1275
Authors:Jonathan Richens · Tom Everitt · David Abel
Abstract:
Are world models a necessary ingredient for flexible, goaldirected behaviour, or is model-free learning sufficient?We provide a formal answer to this question, showing that any agent capable of zero-shot generalization must have learned a generative model of its environment. We show that this model can be inferred directly from the agent's policy, and that minimizing regret, particularly for long time-horizon tasks, requires learning increasingly accurate world models. Our findings provide new insights into the foundations of intelligence and agency, establish fundamental bounds on agent capabilities from the learnability of environment dynamics, and highlight the crucial role of world models in achieving safe and general AI.
Paperid:1276
Authors:Ya Jiang · Chuxiong Wu · Massieh Kordi Boroujeny · Brian Mark · Kai Zeng
Abstract:
Watermarking for large language models (LLMs) offers a promising approach to identifying AIgenerated text. Existing approaches, however, either compromise the distribution of original generated text by LLMs or are limited to embedding zero-bit information that only allows for watermark detection but ignores identification. We present StealthInk, a stealthy multi-bit watermarking scheme that preserves the original text distribution while enabling the embedding of provenance data, such as userID, TimeStamp, and modelID, within LLM-generated text. This enhances fast traceability without requiring access to the language model's API or prompts. We derive a lower bound on the number of tokens necessary for watermark detection at a fixed equal error rate, which provides insights on how to enhance the capacity. Comprehensive empirical evaluations across diverse tasks highlight the stealthiness, detectability, and resilience of StealthInk, establishing it as an effective solution for LLM watermarking applications.
Paperid:1277
Authors:Hao Qiu · Emmanuel Esposito · Mengxiao Zhang
Abstract:
Abstract:In this work, we study the online convex optimization problem with curved losses and delayed feedback. When losses are strongly convex, existing approaches obtain regret bounds of order $d_{\max} \ln T$, where $d_{\max}$ is the maximum delay and $T$ is the time horizon. However, in many cases, this guarantee can be much worse than $\sqrt{d_{\mathrm{tot}}}$ as obtained by a delayed version of online gradient descent, where $d_{\mathrm{tot}}$ is the total delay. We bridge this gap by proposing a variant of followthe-regularized-leader that obtains regret of order $\min\\{\sigma_{\max}\ln T, \sqrt{d_{\mathrm{tot}}}\\}$, where $\sigma_{\max}$ is the maximum number of missing observations. We additionally consider exp-concave losses and extend the Online Newton Step algorithm to handle delays with an adaptive learning rate tuning, achieving regret $\min\\{d_{\max} n\ln T, \sqrt{d_{\mathrm{tot}}}\\}$ where $n$ is the dimension. To our knowledge, this is the first algorithm to achieve such a regret bound for exp-concave losses. We also design a variant of the Vovk-Azoury-Warmuth forecaster with a clipping trick for online linear regression under delayed feedback. Finally, we provide an experimental validation of our algorithms, showing an improved performance over existing methods.
Paperid:1278
Authors:Luan Yang · Jingdong Zhang · Qunxi Zhu · Wei Lin
Abstract:
Learningenabled controllers with stability certificate functions have demonstrated impressive empirical performance in addressing control problems in recent years. Nevertheless, directly deploying the neural controllers onto actual digital platforms requires impractically excessive communication resources due to a continuously updating demand from the closed-loop feedback controller. We introduce a framework aimed at learning the event-triggered controller (ETC) with optimal scheduling, i.e., minimal triggering times, to address this challenge in resource-constrained scenarios. Our proposed framework, denoted by Neural ETC, includes two practical algorithms: the path integral algorithm based on directly simulating the event-triggered dynamics, and the Monte Carlo algorithm derived from new theoretical results regarding lower bound of inter-event time. Furthermore, we propose a projection operation with an analytical expression that ensures theoretical stability and schedule optimality for Neural ETC. Compared to the conventional neural controllers, our empirical results show that the Neural ETC significantly reduces the required communication resources while enhancing the control performance in constrained communication resources scenarios.
Paperid:1279
Authors:Kexin Huang · Ziqian Chen · xue wang · Chongming Gao · Jinyang Gao · Bolin Ding · Xiang Wang
Abstract:
Auction plays a crucial role in many modern trading environments, including online advertising and public resource allocation. As the number of competing bidders increases, learning Bayesian Nash Equilibrium (BNE) in auctions faces significant scalability challenges. Existing methods often experience slow convergence in largescale auctions. For example, in a classic symmetric auction setting, the convergence rate depends on the number of bidders quadratically.To address this issue, we propose theApproximate Best Response Gradientmethod, a new approach for learning BNE efficiently in auction games. We leverage an analytic solution for gradient estimation to enable efficient gradient computation during optimization. Moreover, we introduce theBest Response Distanceobjective, which serves as an upper bound of approximation quality to BNE. By optimizing the new objective, our method is proven to achieve a local convergence rate independent of bidder numbers and circumvent the traditional quadratic complexity in the classic symmetric setting.Extensive experiments across various auction formats demonstrate that our approach accelerates convergence and enhances learning efficiency in complex auction settings.
Paperid:1280
Authors:Avinash Kori · Francesca Toni · Ben Glocker
Abstract:
Modular objectcentric representations are essential forhuman-like reasoningbut are challenging to obtain under spatial ambiguities,e.g. due to occlusions and view ambiguities. However, addressing challenges presents both theoretical and practical difficulties. We introduce a novel multi-view probabilistic approach that aggregates view-specific slots to captureinvariant contentinformation while simultaneously learning disentangled globalviewpoint-levelinformation. Unlike prior single-view methods, our approach resolves spatial ambiguities, provides theoretical guarantees for identifiability, and requiresno viewpoint annotations. Extensive experiments on standard benchmarks and novel complex datasets validate our method's robustness and scalability.
Paperid:1281
Authors:Yi Zhang · Linjun Huang · Yun Yang · Xiaofeng Shao
Abstract:
Abstract:Conditional mean independence (CMI) testing is crucial for statistical tasks including model determination and variable importance evaluation. In this work, we introduce a novel population CMI measure and a bootstrapbased testing procedure that utilizes deep generative neural networks to estimate the conditional mean functions involved in the population measure. The test statistic is thoughtfully constructed to ensure that even slowly decaying nonparametric estimation errors do not affect the asymptotic accuracy of the test. Our approach demonstrates strong empirical performance in scenarios with high-dimensional covariates and response variable, can handle multivariate responses, and maintains nontrivial power against local alternatives outside an $n^{-1/2}$ neighborhood of the null hypothesis. We also use numerical simulations and real-world imaging data applications to highlight the efficacy and versatility of our testing procedure.
Paperid:1282
Authors:Pablo Samuel Castro · Nenad Tomasev · Ankit Anand · Navodita Sharma · Rishika Mohanta · Aparna Dev · Kuba Perlin · Siddhant Jain · Kyle Levin · Noemi Elteto · Will Dabney · Alexander Novikov · Glenn Turner · Maria Eckstein · Nathaniel Daw · Kevin Miller · Kimberly Stachenfeld
Abstract:
Symbolic models play a key role in cognitive science, expressing computationally precise hypotheses about how the brain implements a cognitive process. Identifying an appropriate model typically requires a great deal of effort and ingenuity on the part of a human scientist.Here, we adapt FunSearch (RomeraParedes et al. 2024), a recently developed tool that uses Large Language Models (LLMs) in an evolutionary algorithm, to automatically discover symbolic cognitive models that accurately capture human and animal behavior.We consider datasets from three species performing a classic reward-learning task that has been the focus of substantial modeling effort, and find that the discovered programs outperform state-of-the-art cognitive models for each.The discovered programs can readily be interpreted as hypotheses about human and animal cognition, instantiating interpretable symbolic learning and decision-making algorithms. Broadly, these results demonstrate the viability of using LLM-powered program synthesis to propose novel scientific hypotheses regarding mechanisms of human and animal cognition.
Paperid:1283
Authors:Haoyu Wang · Shikun Liu · Rongzhe Wei · Pan Li
Abstract:
Large language models (LLMs) have recently been introduced to graph learning, aiming to extend their zeroshot generalization success to tasks where labeled graph data is scarce. Among these applications, inference over text-attributed graphs (TAGs) presents unique challenges: existing methods struggle with LLMs' limited context length for processing large node neighborhoods and the misalignment between node embeddings and the LLM token space. To address these issues, we establish two key principles for ensuring generalization and derive the framework LLM-BP accordingly: (1) Unifying the attribute space with task-adaptive embeddings, where we leverage LLM-based encoders and task-aware prompting to enhance generalization of the text attribute embeddings; (2) Developing a generalizable graph information aggregation mechanism, for which we adopt belief propagation with LLM-estimated parameters that adapt across graphs. Evaluations on 11 real-world TAG benchmarks demonstrate that LLM-BP significantly outperforms existing approaches, achieving 9.85% improvement with task-conditional embeddings and an additional 1.93% gain from adaptive aggregation.
Paperid:1284
Authors:Mehmet Yiğit Balık · Maksim Sinelnikov · Priscilla Ong · Harri Lähdesmäki
Abstract:
Highdimensional time-series datasets are common in domains such as healthcare and economics. Variational autoencoder (VAE) models where latent variables are modeled with Gaussian process (GP) prior have become a prominent model class to analyze such correlated datasets. However, their applications are challenged by the inherent cubic time complexity that requires specific GP approximation techniques, as well as the general challenge of modeling both shared and individual-specific correlations across time. In this work, we propose a scalable basis function approximation technique for GP prior VAEs that mitigates these challenges and results in linear time complexity. The proposed model has a global parametrization that eliminates the need for amortized variational inference and the associated amortization gap, making it particularly well-suited for conditional generation tasks where accuracy and efficiency are crucial. Empirical evaluations on synthetic and real-world benchmark datasets demonstrate that our approach not only improves computational efficiency but also drastically enhances predictive performance.
Paperid:1285
Authors:Zhen Sun · Lei Tan · Yunhang Shen · Chengmao Cai · Xing Sun · Pingyang Dai · Liujuan Cao · Rongrong Ji
Abstract:
Multimodal person reidentification (Re-ID) aimsto match the pedestrian images of the same iden-tity from different modalities. Existing worksmainly focus on cross-modal data matching insingle-modal input while ignoring arbitrary queryand retrieval using multimodal data (includingRGB, infrared, sketches, and text information).This oversight significantly constrains the prac-tical application of person Re-ID in real-worldscenarios. In this paper, we propose FlexiReID,a flexible modality person ReID framework thatsupports seven different retrieval methods, includ-ing text, sketches, RGB, and infrared images, aswell as any combination thereof. Specifically,we first propose an adaptive mixture of experts(MOE) mechanism to integrate information fromdiverse modalities effectively. This mechanismdynamically selects and combines outputs fromvarious expert networks, thereby leveraging thestrengths of each expert to improve multimodalretrieval performance. Then, we devise a cross-modal query fusion module to optimize the ex-traction of fused features across different modal-ities. To the best of our knowledge, this is thefirst framework to propose a flexible multimodalretrieval for person ReID. Furthermore, we haveexpanded the modalities of four multimodal ReIDdatasets (CUHK-PEDES, ICFG-PEDES, RST-PReID, and SYSU-MM01) to construct a uni-fied dataset named CIRS-PEDES, which encom-passes four modalities: text, sketches, RGB, andinfrared. Extensive experiments on the CIRS-PEDES dataset show that the proposed FlexiReID
Paperid:1286
Authors:Kangjie Chen · Muyang Li · Guanlin Li · Shudong Zhang · Shangwei Guo · Tianwei Zhang
Abstract:
VisionLanguage Models (VLMs) have become a cornerstone in multi-modal artificial intelligence, enabling seamless integration of visual and textual information for tasks such as image captioning, visual question answering, and cross-modal retrieval. Despite their impressive capabilities, these models often exhibit inherent vulnerabilities that can lead to safety failures in critical applications. Red-teaming is an important approach to identify and test system's vulnerabilities, but how to conduct red-teaming for contemporary VLMs is an unexplored area. In this paper, we propose a novel multi-modal red-teaming approach, TRUST-VLM, to enhance both the attack success rate and the diversity of successful test cases for VLMs. Specifically, TRUST-VLM is built upon the in-context learning to adversarially test a VLM on both image and text inputs. Furthermore, we involve feedback from the target VLM to improve the efficiency of test case generation. Through extensive experimentation, we demonstrate that TRUST-VLM not only outperforms traditional red-teaming techniques in generating diverse and effective adversarial cases but also provides actionable insights for model improvement. These findings highlight the importance of advanced red-teaming strategies in ensuring the safety, reliability, and fairness of next-generation vision-language systems.
Paperid:1287
Authors:Alessandro Licciardi · Davide Leo · Eros Fanì · Barbara Caputo · Marco Ciccone
Abstract:
Federated Learning (FL) emerged as a decentralized paradigm to train models while preserving privacy. However, conventional FL struggles with data heterogeneity and class imbalance, which degrade model performance.Clustered FL balances personalization and decentralized training by grouping clients with analogous data distributions, enabling improved accuracy while adhering to privacy constraints. This approach effectively mitigates the adverse impact of heterogeneity in FL.In this work, we propose a novel clustering method for FL,FedGWC(Federated Gaussian Weighting Clustering), which groups clients based on their data distribution, allowing training of a more robust and personalized model on the identified clusters.FedGWCidentifies homogeneous clusters by transforming individual empirical losses to model client interactions with a Gaussian reward mechanism. Additionally, we introduce theWasserstein Adjusted Score, a new clustering metric for FL to evaluate cluster cohesion with respect to the individual class distribution. Our experiments on benchmark datasets show thatFedGWCoutperforms existing FL algorithms in cluster quality and classification accuracy, validating the efficacy of our approach.
Paperid:1288
Authors:Tianzhe Chu · Yuexiang Zhai · Jihan Yang · Shengbang Tong · Saining Xie · Dale Schuurmans · Quoc Le · Sergey Levine · Yi Ma
Abstract:
Supervised finetuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.
Paperid:1289
Authors:Christopher Wewer · Bartlomiej Pogodzinski · Bernt Schiele · Jan Eric Lenssen
Abstract:
We introduce Spatial Reasoning Models (SRMs), a framework to perform reasoning over sets of continuous variables via denoising generative models. SRMs infer continuous representations on a set of unobserved variables, given observations on observed variables. Current generative models on spatial domains, such as diffusion and flow matching models, often collapse to hallucination in case of complex distributions. To measure this, we introduce a set of benchmark tasks that test the quality of complex reasoning in generative models and can quantify hallucination. The SRM framework allows to report key findings about importance of sequentialization in generation, the associated order, as well as the sampling strategies during training. It demonstrates, for the first time, that order of generation can successfully be predicted by the denoising network itself. Using these findings, we can increase the accuracy of specific reasoning tasks from <1% to >50%.
Paperid:1290
Authors:Bowen Zheng · Tianming Yang
Abstract:
Diffusion models (DMs) generate highquality samples but are computationally expensive during inference. To address this issue, distillation methods have been developed to train a simpler student model to mimic a complex teacher model. However, existing distillation methods are often computationally demanding, and the student model produces degraded outputs. The problem may be alleviated by adding a GAN objective during distillation, yet the underlying reason remains unclear. Here, we identify a fundamental limitation of diffusion distillation: the teacher and student models, due to different step sizes and parameter numbers, converge to distinct local minima, making direct imitation suboptimal. We further show that using a standalone GAN objective without a distillation objective circumvents this problem and is sufficient to convert diffusion models into efficient one-step generators. The efficiency of this method suggests that diffusion training of the original model may be considered as generative pre-training, equipping the model with capabilities that can be utilized for one-step generation through fine-tuning with the GAN objective. This view is further supported by the freezing experiment. 85\% of the parameters are frozen in fine-tuning, during which the original diffusion model condenses into a one-step generator. The resulting model exhibits strong performance with only 0.2M training images (FID = 4.12, CIFAR-10). Scaling up to 5M images yields near-SOTA results across multiple datasets, including CIFAR-10 (FID = 1.54), AFHQv2 64x64 (FID = 1.24), FFHQ 64x64 (FID = 0.85), and ImageNet 64x64 (FID = 1.16). Our findings provide new insights into diffusion training, showing that it functions as a powerful generative pre-training mechanism and challenging the necessity of multi-step distillation for efficient generative modeling.
Paperid:1291
Authors:Fivos Kalogiannis · Emmanouil-Vasileios Vlatakis-Gkaragkounis · Ian Gemp · Georgios Piliouras
Abstract:
We provide the first provable guarantees of global convergence to Nash equilibria (NE) in twoplayer zero-sum convex Markov games (cMGs) using independent policy gradient methods. Convex Markov games, recently described by (Gemp et. al 2024) , extend Markov decision processes to multi-agent settings with convex preferences over occupancy measures, offering a broad framework for modeling generic strategic interactions. However, even the fundamental min-max case of cMGs presents significant challenges, including inherent nonconvexity even under direct policy parametrization, the absence of Bellman equations, and the complexities of infinite horizons.Our results follow a two-step approach. First, leveraging the properties of hidden-convex–hidden-concave functions, we show that a carefully crafted regularization transforms the underlying min-max optimization problem into a nonconvex–proximal Polyak-Lojasiewicz (NC-pPL) structure.Crucially, this regularization can stabilize the iterates of independent policy gradient methods and ultimately lead them to converge to equilibria.Second, building on this reduction, we address the general constrained min-max problems under NC-pPL and two-sided pPL conditions, providing the first global convergence guarantees for stochastic nested and alternating gradient descent-ascent methods, which we believe may be of independent interest.
Paperid:1292
Authors:Siqi Kou · Jiachun Jin · Zhihong Liu · Chang Liu · Ye Ma · jian jia · Quan Chen · Peng Jiang · Zhijie Deng
Abstract:
We introduce Orthus, a unified multimodal model that excels in generating interleaved images and text from mixedmodality inputs by simultaneously handling discrete text tokens and continuous image features under the \textbf{AR} modeling principle. The continuous treatment of visual signals minimizes the information loss while the fully AR formulation renders the characterization of the correlation between modalities straightforward. Orthus leverages these advantages through its modality-specific heads---one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features. We devise an efficient strategy for building Orthus---by substituting the Vector Quantization (VQ) operation in the existing unified AR model with a soft alternative, introducing a diffusion head, and tuning the added modules to reconstruct images, we can create an Orthus-base model effortlessly (e.g., within 72 A100 GPU hours). Orthus-base can further embrace post-training to craft lengthy interleaved image-text, reflecting the potential for handling intricate real-world tasks. For visual understanding and generation, Orthus achieves a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters, outperforming competing baselines including Show-o and Chameleon.
Paperid:1293
Authors:Xinyu Zhao · Fangcong Yin · Greg Durrett
Abstract:
Longcontext LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically-generated long-context data. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle'' concepts to be retrieved and diversity of the surrounding "haystack'' context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. Although models trained on synthetic data underperform models trained on the real data, the impacts of both training settings can be understood via a shared feature of the attention computation, retrieval heads (Wu et al., 2024). The retrieval heads learned from synthetic data have high overlap with retrieval heads learned on real data. Furthermore, there is a strong correlation between the recall of heads learned and the downstream performance of a model, allowing us to interpret and predict the performance of models trained in different settings. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world LLM capabilities over long contexts.
Paperid:1294
Authors:Robert Geirhos · Priyank Jaini · Austin Stone · Sourabh Medapati · Xi Yi · George Toderici · Abhijit Ogale · Jonathon Shlens
Abstract:
Training a neural network is a monolithic endeavor, akin to carving knowledge into stone: once the process is completed, editing the knowledge in a network is nearly impossible, since all information is distributed across the network's weights. We here explore a simple, compelling alternative by marrying the representational power of deep neural networks with the flexibility of a database. Decomposing the task of image classification into image similarity (from a pretrained embedding) and search (via fast nearest neighbor retrieval from a knowledge database), we build a simple and flexible visual memory that has the following key capabilities: (1.) The ability to flexibly add data across scales: from individual samples all the way to entire classes and billion-scale data; (2.) The ability to remove data through unlearning and memory pruning; (3.) An interpretable decision-mechanism on which we can intervene to control its behavior. Taken together, these capabilities comprehensively demonstrate the benefits of an explicit visual memory. We hope that it might contribute to a conversation on how knowledge should be represented in deep vision models---beyond carving it in "stone" weights.
Paperid:1295
Authors:Qinwen Luo · Ming-Kun Xie · Ye-Wen Wang · Sheng-Jun Huang
Abstract:
Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset. To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states. However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma: Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates. To address this issue, we propose the selective stateadaptive regularization method for offline RL. On the one hand, we propose to trust state-level Bellman-driven results by obtaining state-adaptive regularization coefficients in a learning manner; on the other hand, we selectively perform regularization on a subset of data consisting of high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions. By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches. Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark.
Paperid:1296
Authors:Amir Najafi · Samin Sani · Farzan Farnia
Abstract:
Abstract:Consider a network of clients with private, nonIID local datasets governed by an unknown meta-distribution. A central server aims to evaluate the average performance of a given ML model, not only on this network (standard beta or A/B testing) but also on all possible unseen networks that are meta-distributionally similar to it, as measured by either $f$-divergence or Wasserstein distance. To this end, we propose a novel robust optimization framework that can be implemented in a private and federated manner with at most polynomial time and query complexity. Specifically, we introduce a private server-side aggregation technique for local adversarial risks, which provably enables a global robust evaluation of the model. We also establish asymptotically minimax-optimal bounds for the risk average and CDF, with vanishing generalization gaps as the source network size $K$ grows and the minimum local dataset size exceeds $\mathcal{O}\left(\log K\right)$. Empirical results further validate the effectiveness of our bounds in real-world tasks.
Paperid:1297
Authors:Fanqing Meng · Jiaqi Liao · Xinyu Tan · WENQI SHAO · Quanfeng Lu · Kaipeng Zhang · Yu Cheng · Dianqi Li · Ping Luo
Abstract:
Textto-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive \textbf{Phy}sics \textbf{Gen}eration \textbf{Ben}chmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications. We will release the data and codes at https://github.com/PhyGenBench/PhyGenBench
Paperid:1298
Authors:Penghao Wu · Lewei Lu · Ziwei Liu
Abstract:
Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive visual tokens. Unlike token reduction methods that focus on tokenlevel redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further.
Paperid:1299
Authors:Bianca Marin Moreno · Khaled Eldowa · Pierre Gaillard · Margaux Brégère · Nadia Oudjane
Abstract:
We study online learning in episodic finitehorizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent’s policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use a novel online mirror descent algorithm with variable constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent's policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.
Paperid:1300
Authors:Roman Klypa · Alberto Bietti · Sergei Grudinin
Abstract:
Designing RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Existing computational approaches require a substantial amount of experimentally determined RNA sequences for each specific protein or a detailed knowledge of RNA structure, restricting their utility in practice. To address this limitation, we develop RNABAnG, a deep learning-based model designed to generate RNA sequences for protein interactions without these requirements. Central to our approach is a novel generative method, Bidirectional Anchored Generation (BAnG), which leverages the observation that protein-binding RNA sequences often contain functional binding motifs embedded within broader sequence contexts. We first validate our method on generic synthetic tasks involving similar localized motifs to those appearing in RNAs, demonstrating its benefits over existing generative approaches. We then evaluate our model on biological sequences, showing its effectiveness for conditional RNA sequence design given a binding protein.
Paperid:1301
Authors:Rohit Sonker · Alexandre Capone · Andrew Rothstein · Hiro Kaga · Egemen Kolemen · Jeff Schneider
Abstract:
Machine learning algorithms often struggle to control complex realworld systems. In the case of nuclear fusion, these challenges are exacerbated, as the dynamics are notoriously complex, data is poor, hardware is subject to failures, and experiments often affect dynamics beyond the experiment's duration. Existing tools like reinforcement learning, supervised learning, and Bayesian optimization address some of these challenges but fail to provide a comprehensive solution. To overcome these limitations, we present a multi-scale Bayesian optimization approach that integrates a high-frequency data-driven dynamics model with a low-frequency Gaussian process. By updating the Gaussian process between experiments, the method rapidly adapts to new data, refining the predictions of the less reliable dynamical model. We validate our approach by controlling tearing instabilities in the DIII-D nuclear fusion plant. Offline testing on historical data shows that our method significantly outperforms several baselines. Results on Live experiments on the DIII-D tokamak, conducted under high-performance plasma scenarios prone to instabilities, shows a 50\% success rate — an improvement of 117\% compared to historical outcomes.
Paperid:1302
Authors:Anhao Zhao · Fanghua Ye · Yingqi Fan · Junlong Tong · Jing Xiong · Zhiwei Fei · Hui Su · Xiaoyu Shen
Abstract:
Large language models (LLMs) excel in various tasks but are hindered by high computational costs due to their deep and layerwise structure. Traditional layer pruning methods are static and fail to adapt to the varying computational needs of different tokens. Recent dynamic pruning methods attempt to address this but still encounter three main challenges: (1)Horizontal Dynamics– rigid token-wise or layer-wise compute allocation lacks global flexibility; (2)Vertical Dynamics– pruning entire layers without decoupling MLP and self-attention components ignores their unique roles; (3)Training Paradigm– jointly training routers and model parameters overlooks the differences between pruning and pretraining. To address these challenges, we introduceSkipGPT, a novel dynamic pruning framework that adapts to token complexity. SkipGPTredefines sparsityby globally optimizing compute allocation,decouples MLP and self-attentionfor selective pruning of components based on their contribution to inference, and adopts atwo-stage training paradigm, first optimizing the routing mechanism independently, and then applying LoRA fine-tuning to recover lost capacity. Experiments demonstrate that SkipGPT can fully restore or even surpass the original model’s performance, despite a 25\% reduction in parameters. By bridging the gap between efficiency and expressivity, SkipGPT sets a new direction for scalable LLM inference.
Paperid:1303
Authors:Jinbin Zhang · Nasib Ullah · Erik Schultheis · Rohit Babbar
Abstract:
Large output spaces, also referred to as Extreme multilabel classification (XMC), is a setting that arises, e.g., in largescale tagging and product-to-product recommendation, and is characterized by the number of labels ranging from hundreds of thousands to millions. This means that the linear classification head, usually only a tiny fraction of the overall model, turns into the main driver for compute and memory demand. Current state-of-the-art XMC methods predominantly rely on FP16-FP32 mixed-precision training (MPT), which we show can be inefficient in terms of memory usage, computational overhead and stability. Meanwhile, existing low-precision methods typically retain higher precision for the classification layer. In this work, we propose a pure low-precision training framework for XMC models using BFLOAT16 and FLOAT8 data types. By leveraging Kahan summation and stochastic rounding, we demonstrate that XMC models can be effectively trained entirely in FLOAT8, without relying on single-precision master weights or tensor scaling. Low-precision training, combined with our proposed memory optimizations—gradient fusion and chunking—enables significant reductions in GPU memory usage. For example, we train a 3-million-label XMC model with only 6.6 GiB of GPU memory, compared to the 39.7 GiB required by the optimized SOTA method Renee. Notably, these substantial efficiency gains do not come at the expense of performance: our method outperforms and competes with existing XMC baselines on most public datasets, achieving the non-trivial goal of improving performance under purely low-precision constraints.
Paperid:1304
Authors:Yuchen Zeng · Tuan Dinh · Wonjun Kang · Andreas Mueller
Abstract:
Leveraging the incontext learning (ICL) capability of Large Language Models (LLMs) for tabular classification has gained significant attention for its training-free adaptability across diverse datasets. Recent advancements, like TabPFN, excel in small-scale tabular datasets but struggle to scale for large and complex datasets. Our work enhances the efficiency and scalability of TabPFN for larger datasets by incorporating linear attention mechanisms as a scalable alternative to complexity-quadratic self-attention. Our model, TabFlex, efficiently handles tabular datasets with thousands of features and hundreds of classes, scaling seamlessly to millions of samples. For instance, TabFlex processes the poker-hand dataset with over a million samples in just 5 seconds. Our extensive evaluations demonstrate that TabFlex can achieve over a 2× speedup compared to TabPFN and a 1.5× speedup over XGBoost, outperforming 25 tested baselines in terms of efficiency across a diverse range of datasets. Furthermore, TabFlex remains highly effective on large-scale datasets, delivering strong performance with significantly reduced computational costs, especially when combined with data-efficient techniques such as dimensionality reduction and data sampling.
Paperid:1305
Authors:Tao Jin · Yue Wu · Quanquan Gu · Farzad Farnoud
Abstract:
Abstract:We study the problem of efficiently aggregating the preferences of items from multiple information sources (oracles) and infer the ranking under both the weak stochastic transitivity (WST) and the strong stochastic transitivity (SST) conditions. When the underlying preference model satisfies the WST condition, we propose an algorithm named RMOWST, which has a bi-level design: at the higher level, it actively allocates comparison budgets to all undetermined pairs until the full ranking is recovered; at the lower level, it attempts to compare the pair of items and selects the more accurate oracles simultaneously. We prove that the sample complexity of RMO-WST is $ \tilde O( N\sum_{i=2}^{N}H_{\sigma^{-1}(i),{\sigma^{-1}(i-1)}} )$, where $N$ is the number of items to rank, $H$ is a problem-dependent hardness factor, and $\sigma^{-1}(i)$ represents the $i$-th best item. We also provide a tight lower bound that matches the upper bound of approximate ranking under the WST condition, answering a previously open problem. In addition, when the SST condition is satisfied, we propose an algorithm named RMO-SST, which can achieve an $\tilde{O}(\sum_{i=1}^{N} H_i \log(N))$ sample complexity. This outperforms the best-known sample complexity by a factor of $\log(N)$. The theoretical advantages of our algorithms are verified by empirical experiments in a simulated environment.
Paperid:1306
Authors:Alon Beck · Noam Levi · Yohai Bar-Sinai
Abstract:
We investigate the phenomenon of grokking delayed generalization accompanied by non-monotonic test loss behavior - in a simple binary logistic classification task, for which "memorizing" and "generalizing" solutions can be strictly defined. Surprisingly, we find that grokking arises naturally even in this minimal model when the parameters of the problem are close to a critical point, and provide both empirical and analytical insights into its mechanism.Concretely, by appealing to the implicit bias of gradient descent, we show that when the training dataset is highly unbalanced, as well as on the verge of being linearly separable from the origin, logistic regression can exhibit grokking.The underlying reason is that near the critical point, "flat" directions in the loss landscape with nearly zero gradient cause training dynamics to linger for arbitrarily long times near quasi-stable solutions before eventually reaching the global minimum.Finally, we highlight similarities between our findings and the recent literature, strengthening the conjecture that grokking generally occurs in proximity to the interpolation threshold, reminiscent of critical phenomena often observed in physical systems.
Paperid:1307
Authors:Juan L. Gamella · Simon Bing · Jakob Runge
Abstract:
We evaluate methods for causal representation learning (CRL) on a simple, realworld system where these methods are expected to work. The system consists of a controlled optical experiment specifically built for this purpose, which satisfies the core assumptions of CRL and where the underlying causal factors---the inputs to the experiment---are known, providing a ground truth. We select methods representative of different approaches to CRL and find that they all fail to recover the underlying causal factors. To understand the failure modes of the evaluated algorithms, we perform an ablation on the data by substituting the real data-generating process with a simpler synthetic equivalent. The results reveal a reproducibility problem, as most methods already fail on this synthetic ablation despite its simple data-generating process. Additionally, we observe that common assumptions on the mixing function are crucial for the performance of some of the methods but do not hold in the real data. Our efforts highlight the contrast between the theoretical promise of the state of the art and the challenges in its application. We hope the benchmark serves as a simple, real-world sanity check to further develop and validate methodology, bridging the gap towards CRL methods that work in practice. We make all code and datasets publicly available at.
Paperid:1308
Authors:Xinyu Liu · Zixuan Xie · Shangtong Zhang
Abstract:
Abstract:$Q$learning is one of the most fundamental reinforcement learning algorithms. Previously, it is widely believed that $Q$-learning with linear function approximation (i.e., linear $Q$-learning) suffers from possible divergence. This paper instead establishes the first $L^2$ convergence rate of linear $Q$-learning to a bounded set. Notably, we do not make any modification to the original linear $Q$-learning algorithm, do not make any Bellman completeness assumption, and do not make any near-optimality assumption on the behavior policy. All we need is an $\epsilon$-softmax behavior policy with an adaptive temperature. The key to our analysis is the general result of stochastic approximations under Markovian noise with fast-changing transition functions. As a side product, we also use this general result to establish the $L^2$ convergence rate of tabular $Q$-learning with an $\epsilon$-softmax behavior policy, for which we rely on a novel pseudo-contraction property of the weighted Bellman optimality operator.
Paperid:1309
Authors:Nikita Morozov · Ian Maksimov · Daniil Tiapkin · Sergey Samsonov
Abstract:
Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects from a given probability distribution, potentially known up to a normalizing constant. Instead of working in the object space, GFlowNets proceed by sampling trajectories in an appropriately constructed directed acyclic graph environment, greatly relying on the acyclicity of the graph. In our paper, we revisit the theory that relaxes the acyclicity assumption and present a simpler theoretical framework for nonacyclic GFlowNets in discrete environments. Moreover, we provide various novel theoretical insights related to training with fixed backward policies, the nature of flow functions, and connections between entropy-regularized RL and non-acyclic GFlowNets, which naturally generalize the respective concepts and theoretical results from the acyclic setting. In addition, we experimentally re-examine the concept of loss stability in non-acyclic GFlowNet training, as well as validate our own theoretical findings.
Paperid:1310
Authors:Hongtu Zhou · Ruiling Yang · Yakun Zhu · Haoqi Zhao · Hai Zhang · Di Zhang · Junqiao Zhao · Chen Ye · Changjun Jiang
Abstract:
Existing contextbased offline meta-reinforcement learning (COMRL) methods primarily focus on task representation learning and given-context adaptation performance. They often assume that the adaptation context is collected using task-specific behavior policies or through multiple rounds of collection. However, in real applications, the context should be collected by a policy in a one-shot manner to ensure efficiency and safety. We find that intrinsic context ambiguity across multiple tasks and out-of-distribution (OOD) issues due to distribution shift significantly affect the performance of one-shot adaptation, which has been largely overlooked in most COMRL research. To address this problem, we propose using heteroscedastic uncertainty in representation learning to identify ambiguous and OOD contexts, and train an uncertainty-aware context collecting policy for effective one-shot online adaptation. The proposed method can be integrated into various COMRL frameworks, including classifier-based, reconstrution-based and contrastive learning-based approaches. Empirical evaluations on benchmark tasks show that our method can improve one-shot adaptation performance by up to 36% and zero-shot adaptation performance by up to 34% compared to existing baseline COMRL methods.
Paperid:1311
Authors:Arnold Filtser · Shaofeng Jiang · Yi Li · Anurag Murty Naredla · Ioannis Psarros · Qiaoyuan Yang · Qin Zhang
Abstract:
Abstract:We study efficient algorithms for the Euclidean $k$Center problem, focusing on the regime of large $k$. We take the approach of data reduction by considering $\alpha$-coreset, which is a small subset $S$ of the dataset $P$ such that any $\beta$-approximation on $S$ is an $(\alpha + \beta)$-approximation on $P$. We give efficient algorithms to construct coresets whose size is $k \cdot o(n)$, which immediately speeds up existing approximation algorithms. Notably, we obtain a near-linear time $O(1)$-approximation when $k = n^c$ for any $0 < c < 1$. We validate the performance of our coresets on real-world datasets with large $k$, and we observe that the coreset speeds up the well-known Gonzalez algorithm by up to $4$ times, while still achieving similar clustering cost. Technically, one of our coreset results is based on a new efficient construction of consistent hashing with competitive parameters. This general tool may be of independent interest for algorithm design in high dimensional Euclidean spaces.
Paperid:1312
Authors:David Salinas · Omar Swelam · Frank Hutter
Abstract:
Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLMbased judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging.In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trades accuracy for cost and also reduce significantly the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility.
Paperid:1313
Authors:Kiran Purohit · Venktesh V · Sourangshu Bhattacharya · Avishek Anand
Abstract:
The incontext learning paradigm with LLMs has been instrumental in advancing a wide range of natural language processing tasks. The selection of few-shot examples (exemplars / demonstration samples) is essential for constructing effective prompts under context-length budget constraints. In this paper, we formulate the exemplar selection task as a top-m best arms identification problem. A key challenge in this setup is the exponentially large number of arms that need to be evaluated to identify the m-best arms. We propose CASE (Challenger Arm Sampling for Exemplar selection), a novel sample-efficient selective exploration strategy that maintains a shortlist of “challenger” arms, which are current candidates for the top-m arms. In each iteration, only one of the arms from this shortlist or the current top-m set is pulled, thereby reducing sample complexity and, consequently, the number of LLM evaluations. Furthermore, we model the scores of exemplar subsets (arms) using a parameterized linear scoring function, leading to stochastic linear bandits setting. Our sample efficient method, CASE, achieves remarkable efficiency gains of up to 7× speedup in runtime while requiring 7× fewer LLM calls (87% reduction) without sacrificing performance compared to state-of-the-art exemplar selection methods. We release our code and data (https://anonymous.4open.science/r/CASEexemplarbandits-7403).
Paperid:1314
Authors:Feifei Li · Mi Zhang · Zhaoxiang Wang · Min Yang
Abstract:
Point clouds (PC), composed of thousands of 3D points that represent object surfaces, serve as inputs for deep learningbased PC models. Interpretability of PC models becomes imperative given their deployment in safety-critical scenarios such as autonomous vehicles. % , particularly when erroneous predictions may lead to real-world consequences.Our work primarily focuses on attributing PC model outputs to a group of interpretable critical concepts (i.e., subsets of PC).To enable human-understandable diagnostics of model failures, an ideal critical subset should be faithful (preserving points that causally influence predictions) and conceptual coherent (forming semantically meaningful structures that align with human perception).Existing methods infer points' influence using values of gradients or activations, yet neglect the conceptual cohesion.We propose InfoCons, an explanation framework that applies information-theoretic principles to decompose the point cloud into 3D concepts, enabling the examination of their causal effect on model predictions with learnable priors.We evaluate InfoCons on synthetic datasets for classification,comparing it qualitatively and quantitatively with four baselines. Additionally, we demonstrate the scalability of InfoCons across eight PC models, two real-world datasets and in two applications that utilize critical-guided methods.The results show that InfoCons excels at faithfully attributing incorrect predictions and demonstrates scalability and flexibility in its applications.
Paperid:1315
Authors:Ruiqi Zhang · Jingfeng Wu · Peter Bartlett
Abstract:
We analyze the convergence of gradient descent (GD) with large, adaptive stepsizes for logistic regression on linearly separable data. The stepsize adapts to the current risk, scaled by a fixed base stepsize \eta. We prove that once the number of iterates t surpasses a margindependent threshold, the averaged GD iterate achieves a risk upper bound of \exp(-\Theta(\eta t)), where \eta can be chosen arbitrarily large. This implies that GD attains \emph{arbitrarily fast} convergence rates via large stepsizes, although the risk evolution might not be monotonic. In contrast, prior adaptive stepsize GD analyses require a monotonic risk decrease, limiting their rates to \exp(-\Theta(t)). We further establish a margin-dependent lower bound on the iteration complexity for any first-order method to attain a small risk, justifying the necessity of the burn-in phase in our analysis. Our results generalize to a broad class of loss functions and two-layer networks under additional assumptions.
Paperid:1316
Authors:Amit Attia · Ofir Gaash · Tomer Koren
Abstract:
Abstract:We consider the problem of asynchronous stochastic optimization, where an optimization algorithm makes updates based on stale stochastic gradients of the objective that are subject to an arbitrary (possibly adversarial) sequence of delays. We present a procedure which, for any given $q \in (0,1]$, transforms any standard stochastic firstorder method to an asynchronous method with convergence guarantee depending on the $q$-quantile delay of the sequence. This approach leads to convergence rates of the form $O(\tau_q/qT+\sigma/\sqrt{qT})$ for non-convex and $O(\tau_q^2/(q T)^2+\sigma/\sqrt{qT})$ for convex smooth problems, where $\tau_q$ is the $q$-quantile delay, generalizing and improving on existing results that depend on the average delay. We further show a method that automatically adapts to all quantiles simultaneously, without any prior knowledge of the delays, achieving convergence rates of the form $O(\inf_{q} \tau_q/qT+\sigma/\sqrt{qT})$ for non-convex and $O(\inf_{q} \tau_q^2/(q T)^2+\sigma/\sqrt{qT})$ for convex smooth problems. Our technique is based on asynchronous mini-batching with a careful batch-size selection and filtering of stale gradients.
Paperid:1317
Authors:Letong Hong · Zhangyang “Atlas” Wang
Abstract:
Abstract:Maximal Update Parameterization ($\mu$P) has shown significant promise in allowing zeroshot hyperparameter transfer across neural network scales, reducing the prohibitive cost of hyperparameter tuning for large models. However, the theoretical foundation behind theobserved approximate transferability of hyperparameters remains underexplored. We establish the first fundamental separation of scales in $\mu$P between macro-variables (e.g. loss landscapes) and micro-variables (e.g. individual weights). Our formulation explains why hyperparameter tuning can be effectively performed in early training stages, i.e. early statistics effectively approximates global hyperparameter optima, implying the potential to significantly reduce the total number of training steps required for identifying optimal hyperparameters and minimizing search cost. We further apply our main theory to explaining two empirical deep learning phenomena discovered independently by prior work, through one unified lens.
Paperid:1318
Authors:Enric Borrell · Lorenz Richter · Christof Schuette
Abstract:
We extend the standard reinforcement learning framework to random time horizons. While the classical setting typically assumes finite and deterministic or infinite runtimes of trajectories, we argue that multiple realworld applications naturally exhibit random (potentially trajectory-dependent) stopping times. Since those stopping times typically depend on the policy, their randomness has an effect on policy gradient formulas, which we (mostly for the first time) derive rigorously in this work both for stochastic and deterministic policies. We present two complementary perspectives, trajectory or state-space based, and establish connections to optimal control theory. Our numerical experiments demonstrate that using the proposed formulas can significantly improve optimization convergence compared to traditional approaches.
Paperid:1319
Authors:Vint Lee · Minh Nguyen · Leena Elzeiny · Chun Deng · Pieter Abbeel · Wawrzynek
Abstract:
Macro placement is a vital step in digital circuit design that defines the physical location of large collections of components, known as macros, on a 2dimensional chip. The physical layout obtained during placement determines key performance metrics of the chip, such as power consumption, area, and performance. Existing learning-based methods typically fall short because of their reliance on reinforcement learning, which is slow and limits the flexibility of the agent by casting placement as a sequential process. Instead, we use a diffusion model to place all components simultaneously. To enable such models to train at scale, we propose a novel architecture for the denoising model, as well as an algorithm to generate large synthetic datasets for pre-training. We empirically show that our model, when trained on synthetic data, can tackle the placement task, and achieve competitive performance on placement benchmarks compared to state-of-the-art methods.
Paperid:1320
Authors:Walter Mayor · Johan Obando Ceron · Aaron Courville · Pablo Samuel Castro
Abstract:
The use of parallel actors for data collection has been an effective technique used in reinforcement learning (RL) algorithms. The manner in which data is collected in these algorithms, controlled via the number of parallel environments and the rollout length, induces a form of biasvariance trade-off; the number of training passes over the collected data, on the other hand, must strike a balance between sample efficiency and overfitting. We conduct an empirical analysis of these trade-offs on PPO, one of the most popular RL algorithms that uses parallel actors, and establish connections to network plasticity and, more generally, optimization stability. We examine its impact on network architectures, as well as the hyper-parameter sensitivity when scaling data. Our analyses indicate that larger dataset sizes can increase final performance across a variety of settings, and that scaling parallel environments is more effective than increasing rollout lengths. These findings highlight the critical role of data collection strategies in improving agent performance.
Paperid:1321
Authors:Pan Du · Zhao · Xinai Lu · Nian Liu · Zhikai Li · Chaoyu Gong · Suyun Zhao · Hong Chen · Cuiping Li · Kai Wang · Yang You
Abstract:
Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an "other" category. However, they focus on semisupervised scenarios and heavily rely on labeled data, limiting their applicability and performance. To address this, we propose Unsupervised Learning for Class Distribution Mismatch (UCDM), which constructs positive-negative pairs from unlabeled data for classifier training. Our approach randomly samples images and uses a diffusion model to add or erase semantic classes, synthesizing diverse training pairs. Additionally, we introduce a confidence-based labeling mechanism that iteratively assigns pseudo-labels to valuable real-world data and incorporates them into the training process. Extensive experiments on three datasets demonstrate UCDM’s superiority over previous semi-supervised methods. Specifically, with a 60\% mismatch proportion on Tiny-ImageNet dataset, our approach, without relying on labeled data, surpasses OpenMatch (with 40 labels per class) by 35.1%, 63.7%, and 72.5% in classifying known, unknown, and new classes.
Paperid:1322
Authors:Xu Zhang · Kaidi Xu · Ziqing Hu · Ren Wang
Abstract:
Mixture of Experts (MoE) have shown remarkable success in leveraging specialized expert networks for complex machine learning tasks. However, their susceptibility to adversarial attacks presents a critical challenge for deployment in robust applications. This paper addresses the critical question of how to incorporate robustness into MoEs while maintaining high natural accuracy. We begin by analyzing the vulnerability of MoE components, finding that expert networks are notably more susceptible to adversarial attacks than the router. Based on this insight, we propose a targeted robust training technique that integrates a novel loss function to enhance the adversarial robustness of MoE, requiring only the robustification of one additional expert without compromising training or inference efficiency. Building on this, we introduce a dualmodel strategy that linearly combines a standard MoE model with our robustified MoE model using a smoothing parameter. This approach allows for flexible control over the robustness-accuracy trade-off. We further provide theoretical foundations by deriving certified robustness bounds for both the single MoE and the dual-model. To push the boundaries of robustness and accuracy, we propose a novel joint training strategy JTDMoE for the dual-model. This joint training enhances both robustness and accuracy beyond what is achievable with separate models. Experimental results on CIFAR-10 and TinyImageNet datasets using ResNet18 and Vision Transformer (ViT) architectures demonstrate the effectiveness of our proposed methods.
Paperid:1323
Authors:Nabeel Seedat · M van der Schaar
Abstract:
Schema matching the task of finding matches between attributes across disparate data sources with different tables and hierarchies -- is critical for creating interoperable machine learning (ML)-ready data. Addressing this fundamental data-centric problem has wide implications, especially in domains like healthcare, finance and e-commerce --- but also has the potential to benefit ML models more generally, by increasing the data available for ML model training. However, schema matching is a challenging ML task due to structural/hierarchical and semantic heterogeneity between different schemas. Previous ML approaches to automate schema matching have either required significant labeled data for model training, which is often unrealistic or suffer from poor zero-shot performance. To this end, we propose Matchmaker - a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker also self-improves in a zero-shot manner without the need for labeled demonstrations via a novel optimization approach, which constructs synthetic in-context demonstrations to guide the language model's reasoning process. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches, highlighting its potential to accelerate data integration and interoperability of ML-ready data.
Paperid:1324
Authors:Yi Xu · Laura Ruis · Tim Rocktäschel · Robert Kirk
Abstract:
Abstract:Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instructionfollowing abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0\% $\rightarrow$ 96.4\% and 82.1\% $\rightarrow$ 86.3\% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.
Paperid:1325
Authors:Junyi Chai · Taeuk Jang · Jing Gao · Xiaoqian Wang
Abstract:
While numerous work has been proposed to address fairness in machine learning, existing methods do not guarantee fair predictions under imperceptible feature perturbation, and a seemingly fair model can suffer from large groupwise disparities under such perturbation.Moreover, while adversarial training has been shown to be reliable in improving a model's robustness to defend against adversarial feature perturbation that deteriorates accuracy, it has not been properly studied in the context of adversarial perturbation against fairness.To tackle these challenges, in this paper, we study the problem of adversarial attack and adversarial robustness w.r.t. two terms: fairness and accuracy. From the adversarial attack perspective, we propose a unified structure for adversarial attacks against fairness which brings together common notions in group fairness, and we theoretically prove the equivalence of adversarial attacks against different fairness notions. Further, we derive the connections between adversarial attacks against fairness and those against accuracy. From the adversarial robustness perspective, we theoretically align robustness to adversarial attacks against fairness and accuracy, where robustness w.r.t. one term enhances robustness w.r.t. the other term. Our study suggests a novel way to unify adversarial training w.r.t. fairness and accuracy, and experiments show our proposed method achieves better robustness w.r.t. both terms.
Paperid:1326
Authors:Jun-En Ding · Dongsheng Luo · Chenwei Wu · Feng Liu
Abstract:
Analyzing functional brain networks using functional magnetic resonance imaging (fMRI) is crucial for understanding psychiatric disorders and addictive behaviors. Existing fMRIbased graph convolutional networks (GCNs) show significant potential for feature learning in brain networks. However, they are limited in their ability to accurately characterize inter-regional relationships when modeling complex brain networks and in accounting for the influence of interpretable variables in psychiatric disorders. To address these limitations, we propose NeuroTree, which integrates k-hop GCN with neural ordinary differential equations (ODEs). This framework optimizes functional connectivity (FC) strength to enhance dynamic FC feature learning for brain disease classification. Furthermore, it captures high-order brain regional pathway features in a tree topology, enabling the identification of hierarchical neural behavioral patterns from fMRI data. Our empirical evaluations demonstrate that NeuroTree achieves state-of-the-art performance across two distinct mental disorder datasets and provides valuable insights into age-related deterioration patterns. These findings underscore the model's effectiveness in predicting psychiatric disorders and elucidating their underlying neural mechanisms.
Paperid:1327
Authors:Valentyn Boreiko · Alexander Panfilov · Václav Voráček · Matthias Hein · Jonas Geiping
Abstract:
A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safetytuned LLMs. These methods largely succeed in coercing the target output in their original settings, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model checks if a given jailbreak is likely to occur in the distribution of text. For this, we build an N-gram language model on 1T tokens, which, unlike model-based perplexity, allows for an LLM-agnostic, non-parametric, and inherently interpretable evaluation. We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it. After a rigorous comparison, we find attack success rates against safety-tuned modern models to be lower than previously presented and that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent words, either selecting phrases absent from real-world text or rare ones, e.g., specific to Reddit or code datasets.
Paperid:1328
Authors:Allen Nie · Yi Su · Bo Chang · Jonathan Lee · Ed Chi · Quoc Le · Minmin Chen
Abstract:
Despite their success in many domains, large language models (LLMs) remain understudied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs' performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithm-guided support during inference; and through algorithm distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. Additionally, we conduct a rigorous analysis of the LLM's exploration efficiency using the concept of regret, linking its ability to explore to the model size and underlying algorithm.
Paperid:1329
Authors:Tong Yang · Bo Dai · Lin Xiao · Yuejie Chi
Abstract:
Multiagent reinforcement learning (MARL) lies at the heart of a plethora of applications involving the interaction of a group of agents in a shared unknown environment. A prominent framework for studying MARL is Markov games, with the goal of finding various notions of equilibria in a sample-efficient manner, such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). However, existing sample-efficient approaches either require tailored uncertainty estimation under function approximation, or careful coordination of the players. In this paper, we propose a novel model-based algorithm, called VMG, that incentivizes exploration via biasing the empiricalestimate of the model parameters towards those with a higher collective best-response values of all the players when fixing the other players' policies, thus encouraging the policy to deviate from its current equilibrium for more exploration. VMG is oblivious to different forms of function approximation, and permits simultaneous and uncoupled policy updates of all players. Theoretically, we also establish that VMG achieves a near-optimal regret for finding both the NEs of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov games under linear function approximation in an online environment, which nearly match their counterparts with sophisticated uncertainty quantification.
Paperid:1330
Authors:Jeonghoon Kim · Byeongchan Lee · Cheonbok Park · Yeontaek Oh · Beomjun Kim · Taehwan Yoo · Seongjin Shin · Dongyoon Han · Jinwoo Shin · Kang Min Yoo
Abstract:
Abstract:Designing Transformer architectures with the optimal layer normalization (LN) strategy that ensures largescale training stability and expedite convergence has remained elusive, even in this era of large language models (LLMs). To this end, we present a comprehensive analytical foundation for understanding how different LN strategies influence training dynamics in large-scale Transformer training. Until recently, Pre-LN and Post-LN have long dominated standard practices despite their limitations in large-scale training. However, several open-source large-scale models have recently begun silently adopting a third strategy without much explanation. This strategy places layer normalization (LN) peripherally around sublayers, a design we term Peri-LN. While Peri-LN has demonstrated promising empirical performance, its precise mechanisms and benefits remain almost unexplored. Our in-depth analysis shows that Peri-LN strikes an ideal balance in variance growth---unlike Pre-LN and Post-LN, which are prone to vanishing gradients and ``massive activations.'' To validate our theoretical insight, we conduct large-scale experiments on Transformers up to $3.2$B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability. Our results suggest that Peri-LN warrants broader consideration for large-scale Transformer architectures, providing renewed insights into the optimal placement and application of LN.
Paperid:1331
Authors:Wenjing Yan · Xiangyu Zhong · Xiaolu Wang · Angela Yingjun Zhang
Abstract:
Asynchronous federated learning (AFL) has emerged as a promising solution to address system heterogeneity and improve the training efficiency of federated learning. However, existing AFL methods face two critical limitations: 1) they rely on strong assumptions about bounded data heterogeneity across clients, and 2) they require meticulous tuning of learning rates based on unknown system parameters. In this paper, we tackle these challenges by leveraging momentumbased optimization and adaptive learning strategies. We first propose MasFL, a novel momentum-driven AFL framework that successfully eliminates the need for data heterogeneity bounds by effectively utilizing historical descent directions across clients and iterations. By mitigating the staleness accumulation caused by asynchronous updates, we prove that MasFL achieves state-of- the-art convergence rates with linear speedup in both the number of participating clients and local updates. Building on this foundation, we further introduce AdaMasFL, an adaptive variant that incorporates gradient normalization into local updates. Remarkably, this integration removes all dependencies on problem-specific parameters, yielding a fully tuning-free AFL approach while retaining theoretical guarantees. Extensive experiments demonstrate that AdaMasFL consistently outperforms state-of-the-art AFL methods in run- time efficiency and exhibits exceptional robustness across diverse learning rate configurations and system conditions.
Paperid:1332
Authors:Ting-Ji Huang · Jia-Qi Yang · Chunxu Shen · Kai-Qi Liu · De-Chuan Zhan · Han-Jia Ye
Abstract:
Characterizing users and items through vector representations is crucial for various tasks in recommender systems. Recent approaches attempt to apply Large Language Models (LLMs) in recommendation through a question\&answer format, where real items (eg, Item No.2024) are represented with compound words formed from invocabulary tokens (eg,item,20,24). However, these tokens are not suitable for representing items, as their meanings are shaped by pre-training on natural language tasks, limiting the model's ability to capture user-item relationships effectively. In this paper, we explore how to effectively characterize users and items in LLM-based recommender systems from the token construction view. We demonstrate the necessity of using out-of-vocabulary (OOV) tokens for the characterization of items and users, and propose a well-constructed way of these OOV tokens. By clustering the learned representations from historical user-item interactions, we make the representations of user/item combinations share the same OOV tokens if they have similar properties. This construction allows us to capture user/item relationships well (memorization) and preserve the diversity of descriptions of users and items (diversity). Furthermore, integrating these OOV tokens into the LLM’s vocabulary allows for better distinction between users and items and enhanced capture of user-item relationships during fine-tuning on downstream tasks. Our proposed framework outperforms existing state-of-the-art methods across various downstream recommendation tasks.
Paperid:1333
Authors:Joshua Southern · Yam Eitan · Guy Bar Shalom · Michael Bronstein · Haggai Maron · Fabrizio Frasca
Abstract:
Subgraph GNNs have emerged as promising architectures that overcome the expressiveness limitations of Graph Neural Networks (GNNs) by processing bags of subgraphs. Despite their compelling empirical performance, these methods are afflicted by a high computational complexity: they process bags whose size grows linearly in the number of nodes, hindering their applicability to larger graphs. In this work, we propose an effective and easyto-implement approach to dramatically alleviate the computational cost of Subgraph GNNs and unleash broader applications thereof. Our method, dubbed HyMN, leverages walk-based centrality measures to sample a small number of relevant subgraphs and drastically reduce the bag size. By drawing a connection to perturbation analysis, we highlight the strength of the proposed centrality-based subgraph sampling, and further prove that these walk-based centralities can be additionally used as Structural Encodings for improved discriminative power. A comprehensive set of experimental results demonstrates that HyMN provides an effective synthesis of expressiveness, efficiency, and downstream performance, unlocking the application of Subgraph GNNs to dramatically larger graphs. Not only does our method outperform more sophisticated subgraph sampling approaches, it is also competitive, and sometimes better, than other state-of-the-art approaches for a fraction of their runtime.
Paperid:1334
Authors:Xuanming Cui · Chionh Peng · Adriel Kuek · Ser-Nam Lim
Abstract:
Abstract:Neural Theorem Provers (NTPs) present a promising framework for neurosymbolic reasoning, combining end-to-end differentiability with the interpretability of symbolic logic programming. However, optimizing NTPs remains a significant challenge due to their complex objective landscape and gradient sparcity. On the other hand, Knowledge Graph Embedding (KGE) methods offer smooth optimization with well-defined learning objectives but often lack interpretability. In this work, we propose several strategies to integrate the strengths of NTPs and KGEs, and demonstrate substantial improvements in both accuracy and computational efficiency. Specifically, we show that by leveraging the strength of structural learning in KGEs, we can greatly improve NTPs' poorly structured embedding space, while by substituting NTPs with efficient KGE operations, we can significantly reduce evaluation time by over 1000$\times$ on large-scale dataset such as WN18RR with a mild accuracy trade-off.
Paperid:1335
Authors:Zhuoling Li · LiangLiang Ren · Jinrong Yang · Yong Zhao · Xiaoyang Wu · Zhenhua Xu · Xiang Bai · Hengshuang Zhao
Abstract:
The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current manipulation data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation with the task target described by a future image. Nevertheless, a single future image does not describe the target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pretraining (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy completes challenging tasks like ``opening the lid of a tightly sealed bottle''.
Paperid:1336
Authors:Seyed Mohammad Sadegh Mahdavi · Muchen Li · Kaiwen Liu · Christos Thrampoulidis · Leonid Sigal · Renjie Liao
Abstract:
Advances in Large Language Models (LLMs) have sparked interest in their ability to solve Olympiadlevel math problems. However, the training and evaluation of these models are constrained by the limited size and quality of available datasets, as creating large-scale data for such advanced problems requires extensive effort from human experts.In addition, current benchmarks are prone to contamination, leading to unreliable evaluations.In this paper, we present an automated pipeline that leverages the rich resources of the Art of Problem Solving (AoPS) forum, which predominantly features Olympiad-level problems and community-driven solutions.Using open-source LLMs, we develop a method to extract question-answer pairs from the forum, resulting inAoPS-Instruct, a dataset of more than 600,000 high-quality QA pairs.Our experiments demonstrate that fine-tuning LLMs on AoPS-Instruct improves their reasoning abilities across various benchmarks. Moreover, we build an automatic pipeline that introducesLiveAoPSBench, an evolving evaluation set with timestamps, derived from the latest forum data, providing a contamination-resistant benchmark for assessing LLM performance.Notably, we observe a significant decline in LLM performance over time, suggesting their success on older examples may stem from pre-training exposure rather than true reasoning ability. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning, offering valuable insights into the capabilities and limitations of LLMs in this domain.
Paperid:1337
Authors:Daiki Chijiwa · Taku Hasegawa · Kyosuke Nishida · Kuniko Saito · Susumu Takeuchi
Abstract:
While foundation models have been exploited for various expert tasks with their finetuned parameters, any foundation model will be eventually outdated due to its old knowledge or limited capability, and thus should be replaced by a new foundation model. Subsequently, to benefit from its latest knowledge or improved capability, the new foundation model should be fine-tuned on each task again, which incurs not only the additional training cost but also the maintenance cost of the task-specific data. Existing work address this problem by inference-time tuning, i.e., modifying the output probability from the new foundation model by the outputs from the old foundation model and its fine-tuned model, which involves an additional inference cost by the latter two models. In this paper, we explore a new fine-tuning principle (which we call portable reward tuning; PRT) that reduces the inference cost by its nature, based on the reformulation of fine-tuning as the reward maximization with Kullback-Leibler regularization. Specifically, instead of fine-tuning parameters of the foundation models, PRT trains the reward model explicitly through the same loss as in fine-tuning. During inference, the reward model can be used with any foundation model (with the same set of vocabularies or labels) through the formulation of reward maximization. Experimental results, including both vision and language models, show that the PRT-trained model can achieve comparable accuracy with less inference cost, in comparison to the existing work of inference-time tuning.
Paperid:1338
Authors:Yujia Zheng · Shaoan Xie · Kun Zhang
Abstract:
We are born with the ability to learn concepts by comparing diverse observations. This helps us to understand the new world in a compositional manner and facilitates extrapolation, as objects naturally consist of multiple concepts. In this work, we argue that the cognitive mechanism of comparison, fundamental to human learning, is also vital for machines to recover true concepts underlying the data. This offers correctness guarantees for the field of concept learning, which, despite its impressive empirical successes, still lacks general theoretical support. Specifically, we aim to develop a theoretical framework for the identifiability of concepts with multiple classes of observations. We show that with sufficient diversity across classes, hidden concepts can be identified without assuming specific concept types, functional relations, or parametric generative models. Interestingly, even when conditions are not globally satisfied, we can still provide alternative guarantees for as many concepts as possible based on local comparisons, thereby extending the applicability of our theory to more flexible scenarios. Moreover, the hidden structure between classes and concepts can also be identified nonparametrically. We validate our theoretical results in both synthetic and realworld settings.
Paperid:1339
Authors:Kiran Thekumparampil · Gaurush Hiranandani · Kousha Kalantari · Shoham Sabach · Branislav Kveton
Abstract:
Abstract:We study learning human preferences from limited comparison feedback, a core machine learning task with significant impact on reinforcement learning from human feedback (RLHF). We formulate this problem as learning a PlackettLuce (PL) model from a small number of $K$-way comparisons over a universe of $N$ choices, where typically $K \ll N$. Our objective is to select these $K$-subsets to elicit the $K$-way ranking feedback in such a way that the true ranking of the $N$ items can be estimated from the feedback with minimal mistakes. Our approach selects the subsets using the framework of D-optimal design which aims to minimize the worst-case ranking mistake of the estimated PL model. However, all known algorithms for this problem are computationally infeasible in our setting due to exponentially-many ${N \choose K}$ subsets to consider. To address this issue, we propose a Frank-Wolfe algorithm that exploits randomization, memoization, and sparse updates to achieve $O(N^2 + K^2)$ per-iteration complexity. We analyze it, and showcase its superior empirical performance on synthetic and open-source NLP datasets.
Paperid:1340
Authors:Oliver Broadrick · Sanyam Agarwal · Markus Bläser · Guy Van den Broeck
Abstract:
Abstract:Marginalization summing a function over all assignments to a subset of its inputs -- is a fundamental computational problem with applications from probabilistic inference to formal verification.Despite its computational hardness in general, there exist many classes of functions (e.g., probabilistic models) for which marginalization remains tractable, and they can all be commonly expressed by arithmetic circuits computing multilinear polynomials.This raises the question, can *all* functions with polynomial time marginalization algorithms be succinctly expressed by such circuits? We give a negative answer, exhibiting simple functions with tractable marginalization yet no efficient representation by known models, assuming $\\mathsf{FP} \\neq \\#\\mathsf{P}$ (an assumption implied by $\\mathsf{P} \\neq \\mathsf{NP}$). To this end, we identify a hierarchy of complexity classes corresponding to stronger forms of marginalization, all of which are efficiently computable on the known circuit models. We conclude with a completeness result, showing that whenever there is an efficient real RAM performing virtual evidence marginalization for a function, then there are small arithmetic circuits for that function's multilinear representation.
Paperid:1341
Authors:Yuya Hikima · Hiroshi Sawada · Akinori Fujino
Abstract:
In this study, we tackle an optimization problem with a known function and an unknown decisiondependent distribution, which arises in a variety of applications. To solve the problem, several zeroth-order methods have been developed because the gradient of the objective function cannot be obtained explicitly due to the unknown distribution. Although these methods have theoretical convergence, they cannot utilize the information on the known function, which limits their efficiency in reducing the objective value. To overcome this issue, we propose new zeroth-order methods that generate effective update directions by utilizing information on the known function. As theoretical results, we show the convergence of our methods to stationary points and provide the worst-case sample complexity analysis. Our simulation experiments on multiple applications show that our methods output solutions with lower objective values than the existing zeroth-order methods do.
Paperid:1342
Authors:Guy Ohayon · Hila Manor · Tomer Michaeli · Michael Elad
Abstract:
We present a novel generative approach based on Denoising Diffusion Models (DDMs), which produces highquality image samplesalongwith their losslessly compressed bit-stream representations. This is obtained by replacing the standard Gaussian noise sampling in the reverse diffusion with a selection of noise samples from pre-defined codebooks of fixed iid Gaussian vectors. Surprisingly, we find that our method, termedDenoising Diffusion Codebook Model(DDCM), retains sample quality and diversity of standard DDMs, even for extremely small codebooks. We leverage DDCM and pick the noises from the codebooks that best match a given image, converting our generative model into a highly effective lossy image codec achieving state-of-the-art perceptual image compression results. More generally, by setting other noise selections rules, we extend our compression method to any conditional image generation task (e.g., image restoration), where the generated images are produced jointly with their condensed bit-stream representations. Our work is accompanied by a mathematical interpretation of the proposed compressed conditional generation schemes, establishing a connection with score-based approximations of posterior samplers for the tasks considered.
Paperid:1343
Authors:Anas Jnini · Lorenzo Breschi · Flavio Vella
Abstract:
Divergencefree symmetric tensors (DFSTs) are fundamental in continuum mechanics, encoding conservation laws such as mass and momentum conservation. We introduce Riemann Tensor Neural Networks (RTNNs), a novel neural architecture that inherently satisfies the DFST condition to machine precision, providing a strong inductive bias for enforcing these conservation laws. We prove that RTNNs can approximate any sufficiently smooth DFST with arbitrary precision and demonstrate their effectiveness as surrogates for conservative PDEs, achieving improved accuracy across benchmarks. This work is the first to use DFSTs as an inductive bias in neural PDE surrogates and to explicitly enforce the conservation of both mass and momentum within a physics-constrained neural architecture.
Paperid:1344
Authors:Jingyi Cui · Hongwei Wen · Yisen Wang
Abstract:
Selfsupervised contrastive learning has emerged as a powerful tool in machine learning and computer vision to learn meaningful representations from unlabeled data. Meanwhile, its empirical success has encouraged many theoretical studies to reveal the learning mechanisms. However, in the existing theoretical research, the role of data augmentation is still under-exploited, especially the effects of specific augmentation types. To fill in the blank, we for the first time propose an augmentation-aware error bound for self-supervised contrastive learning, showing that the supervised risk is bounded not only by the unsupervised risk, but also explicitly by a trade-off induced by data augmentation. Then, under a novel semantic label assumption, we discuss how certain augmentation methods affect the error bound. Lastly, we conduct both pixel- and representation-level experiments to verify our proposed theoretical results.
Paperid:1345
Authors:Haoqi Wu · Wei Dai · Wang Li · Qiang Yan
Abstract:
Large Language Models (LLMs) have gained significant popularity due to their remarkable capabilities in text understanding and generation. However, despite their widespread deployment in inference services such as ChatGPT, concerns about the potential leakage of sensitive user data have arisen. Existing solutions primarily rely on privacyenhancing technologies to mitigate such risks, facing the trade-off among efficiency, privacy, and utility. To narrow this gap, we propose Cape, a context-aware prompt perturbation mechanism based on differential privacy, to enable efficient inference with an improved privacy-utility trade-off. Concretely, we introduce a hybrid utility function that better captures the token similarity. Additionally, we propose a bucketized sampling mechanism to handle large sampling space, which might lead to long-tail phenomenons. Extensive experiments across multiple datasets, along with ablation studies, demonstrate that Cape achieves a better privacy-utility trade-off compared to prior state-of-the-art works.
Paperid:1346
Authors:Zeyu Wang · Yao-Hui Li · Xin Li · Hongyu Zang · Romain Laroche · Riashat Islam
Abstract:
MultiView Reinforcement Learning (MVRL) seeks to provide agents with multi-view observations, enabling them to perceive environment with greater effectiveness and precision. Recent advancements in MVRL focus on extracting latent representations from multiview observations and leveraging them in control tasks. However, it is not straightforward to learn compact and task-relevant representations, particularly in the presence of redundancy, distracting information, or missing views. In this paper, we proposeMulti-viewFusionState forControl (MFSC), firstly incorporating bisimulation metric learning into MVRL to learn task-relevant representations. Furthermore, we propose a multiview-based mask and latent reconstruction auxiliary task that exploits shared information across views and improves MFSC’s robustness in missing views by introducing a mask token. Extensive experimental results demonstrate that our method outperforms existing approaches in MVRL tasks. Even in more realistic scenarios with interference or missing views, MFSC consistently maintains high performance. The project code is available in the supplementary material.
Paperid:1347
Authors:Junchao Zhou · Yao Lu · Jie Wen · Guangming Lu
Abstract:
Image steganography hides multiple images for multiple recipients into a single cover image. All secret images are usually revealed without authentication, which reduces security among multiple recipients. It is elegant to design an authentication mechanism for isolated reception. We explore such mechanism through sufficient experiments, and uncover that additional authentication information will affect the distribution of hidden information and occupy more hiding space of the cover image. This severely decreases effectiveness and efficiency in largecapacity hiding. To overcome such a challenge, we first prove the authentication feasibility within image steganography. Then, this paper proposes an image steganography network collaborating with separate authentication and efficient scheme. Specifically, multiple pairs of lock-key are generated during hiding and revealing. Unlike traditional methods, our method has two stages to make appropriate distribution adaptation between locks and secret images, simultaneously extracting more reasonable primary information from secret images, which can release hiding space of the cover image to some extent. Furthermore, due to separate authentication, fused information can be hidden in parallel with a single network rather than traditional serial hiding with multiple networks, which can largely decrease the model size. Extensive experiments demonstrate that the proposed method achieves more secure, effective, and efficient image steganography.
Paperid:1348
Authors:Kumar Ashutosh · Yossi Gandelsman · Xinlei Chen · Ishan Misra · Rohit Girdhar
Abstract:
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, trainingfree approach, to imbue multimodal capabilities intoyour favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.
Paperid:1349
Authors:Lin Zhu · Xiantao Ma · Xiao Wang · Lizhi Wang · Hua Huang
Abstract:
Event cameras are innovative sensors that capture brightness changes as asynchronous events rather than traditional intensity frames. These cameras offer substantial advantages over conventional cameras, including high temporal resolution, high dynamic range, and the elimination of motion blur. However, defocus blur, a common image quality degradation resulting from outof-focus lenses, complicates the challenge of event-based imaging. Due to the unique imaging mechanism of event cameras, existing focusing algorithms struggle to operate efficiently on sparse event data. In this work, we propose EvFocus, a novel architecture designed to reconstruct sharp images from defocus event streams for the first time. Our work includes the development of an event-based out-of-focus camera model and a simulator to generate realistic defocus event streams for robust training and testing. EvDefous integrates a temporal information encoder, a blur-aware two-branch decoder, and a reconstruction and re-defocus module to effectively learn and correct defocus blur. Extensive experiments on both simulated and real-world datasets demonstrate that EvFocus outperforms existing methods across varying lighting conditions and blur sizes, proving its robustness and practical applicability in event-based defocus imaging.
Paperid:1350
Authors:Kihyuk Hong · Ambuj Tewari
Abstract:
We study reinforcement learning in infinitehorizon average-reward settings with linear MDPs. Previous work addresses this problem by approximating the average-reward setting by discounted setting and employing a value iteration-based algorithm that uses clipping to constrain the span of the value function for improved statistical efficiency. However, the clipping procedure requires computing the minimum of the value function over the entire state space, which is prohibitive since the state space in linear MDP setting can be large or even infinite. In this paper, we introduce a value iteration method with efficient clipping operation that only requires computing the minimum of value functions over the set of states visited by the algorithm. Our algorithm enjoys the same regret bound as the previous work while being computationally efficient, with computational complexity that is independent of the size of the state space.
Paperid:1351
Authors:Mohamad Chehade · Soumya Suvra Ghosal · Souradip Chakraborty · Avinash Reddy · Dinesh Manocha · Amrit Singh Bedi · Hao Zhu
Abstract:
Abstract:Aligning large language models (LLMs) with human preferences is essential for their safe and effective deployment. Existing methods typically maximize multiple rewards reflecting human preferences, often framed as a multiobjective optimization problem. However, research on bounded rationality suggests that human decision-making follows satisfying strategies—maximizing key objectives while ensuring others meet acceptable thresholds. This aspect is largely overlooked in alignment research. To address this, we introduce UAMD: a user-specified multi-criteria alignment framework, allowing users to set individualized thresholds. Since this personalization complicates training-time alignment, we propose an inference-time alignment method that enforces user-specified thresholds without finetuning. We provide a theoretical analysis of our proposed approach and derive suboptimality upper bounds. We empirically validate the performance of our proposed method through experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, UAMD outperforms the state-of-the-art multi-objective decoding strategy by a margin of $22.3$% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the user-specified threshold on harmlessness.
Paperid:1352
Authors:Chengxing Jia · Ziniu Li · Yi-Chen Li · Pengyuan Wang · Zhenyu Hou · Yuxiao Dong · Yang Yu
Abstract:
Abstract:Adapting Large Language Models (LLMs) to downstream tasks using Reinforcement Learning (RL) has proven to be an effective approach. However, LLMs do not inherently define the structure of an agent for RL training, particularly in terms of defining the action space. Existing frameworks often rely on tokenlevel actions that may be overly large and inefficient. To address this limitation, we propose learning a compact and latent action space to improve controllability and exploration in RL. Inspired by the literature of "RL from observations only", we propose **L**atent **A**ction governed world **M**odel from **P**re-trained LLM (LAMP), which augments a latent action space with a pre-trained LLM to form a latent action language world model. This latent action model is extensively trained from the pre-training dataset and tuned in the post-training stage. Experiments with Llama-3.1-8B as the base model demonstrate that using this latent action model for RL training achieves better controllability on multiple tasks, including a $64.0\%$ win-rate in average of multiple preference alignment tasks, $11.0\%$ improvement in math reasoning task, as well as the improved and flexible searching over the token-level action framework. In addition to these performance gains, we also find that controlling LLMs with latent actions mitigates forgetting and maintains general abilities well.
Paperid:1353
Authors:Luise Ge · Michael Lanier · Anindya Sarkar · Bengisu Guresti · Yevgeniy Vorobeychik · Chongjie Zhang
Abstract:
Many dynamic decision problems, such as robotic control, involve a series of tasks, many of which are unknown at training time. Typical approaches for these problems, such as multitask and meta reinforcement learning, do not generalize well when the tasks are diverse. On the other hand, approaches that aim to tackle task diversity, such as using task embedding as policy context and task clustering, typically lack performance guarantees and require a large number of training tasks. To address these challenges, we propose a novel approach for learning a policy committee that includes at least one near-optimal policy with high probability for tasks encountered during execution. While we show that this problem is in general inapproximable, we present two practical algorithmic solutions. The first yields provable approximation and task sample complexity guarantees when tasks are low-dimensional (the best we can do due to inapproximability), whereas the second is a general and practical gradient-based approach. In addition, we provide a provable sample complexity bound for few-shot learning. Our experiments on MuJoCo and Meta-World show that the proposed approach outperforms state-of-the-art multi-task, meta-, and task clustering baselines in training, generalization, and few-shot learning, often by a large margin.
Paperid:1354
Authors:Hyunseok Lee · Seunghyuk Oh · Jihoon Tack · Jaehyung Kim · Jinwoo Shin
Abstract:
Selfawareness, i.e., the ability to assess and correct one's own generation, is a fundamental aspect of human intelligence, making its replication in large language models (LLMs) an important yet challenging task. Previous works tackle this by employing extensive reinforcement learning or rather relying on large external verifiers. In this work, we propose Refine via Intrinsic Self-Verification (ReVISE), an efficient and effective framework that enables LLMs to self-correct their outputs through self-verification. The core idea of ReVISE is to enable LLMs to verify their reasoning processes and continually rethink reasoning trajectories based on its verification. We introduce a structured curriculum based upon online preference learning to implement this efficiently. Specifically, as ReVISE involves two challenging tasks (i.e., self-verification and reasoning correction), we tackle each task sequentially using curriculum learning, collecting both failed and successful reasoning paths to construct preference pairs for efficient training. During inference, our approach enjoys natural test-time scaling by integrating self-verification and correction capabilities, further enhanced by our proposed confidence-aware decoding mechanism. Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.
Paperid:1355
Authors:David Heurtel-Depeiges · Anian Ruoss · Joel Veness · Tim Genewein
Abstract:
Foundation models are strong data compressors, but when accounting for their parameter size, their compression ratios are inferior to standard compression algorithms. Naively reducing the parameter count does not necessarily help as it deteriorates predictions and, accordingly, compression. We conduct a largescale empirical study to find a sweet spot where pre-trained vanilla transformers can achieve competitive compression ratios. To this end, we train models on 165GB of raw byte sequences of either text, image, or audio data (and all possible combinations of the three) and then compress 1GB of out-of-distribution (OOD) data from each modality. We find that relatively small models (millions of parameters) can outperform standard general-purpose compression algorithms (gzip, LZMA2) and even domain-specific compressors (PNG, JPEG 2000, FLAC) — even when accounting for parameter size. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). We conduct extensive ablations and hyperparameter sweeps to study the impact of model- and dataset scale, and we investigate the effect of unimodal versus multimodal training. We find that even small models can be trained to perform well on multiple modalities, but unlike large-scale foundation models, transfer to unseen modalities is generally weak.
Paperid:1356
Authors:Onno Eberhard · Michael Muehlebach · Claire Vernade
Abstract:
Abstract:Partially observable environments present a considerable computational challenge in reinforcement learning due to the need to consider long histories. Learning with windows of $m$ observations quickly becomes intractable as $m$ grows. In this work, we introduce _memory traces_. Inspired by eligibility traces, these are compact representations of the history of observations in the form of exponential moving averages. We prove sample complexity bounds for the problem of offline onpolicy evaluation that quantify the value errors achieved with memory traces for the general class of Lipschitz continuous value estimates. We find that there is a close connection to the window approach, but also demonstrate that, in certain environments, learning with memory traces is significantly more sample efficient. Finally, we empirically establish the effectiveness of memory traces in online reinforcement learning for both value prediction and control.
Paperid:1357
Authors:Yiding Lu · Mouxing Yang · Dezhong Peng · Peng Hu · Yijie Lin · Xi Peng
Abstract:
Traditional textbased person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (I-ReID). I-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both I-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines. The datasets and code will be publicly available on GitHub upon acceptance.
Paperid:1358
Authors:Seongho Son · William Bankes · Sayak Ray Chowdhury · Brooks Paige · Ilija Bogunovic
Abstract:
Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose NonStationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution of introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyse the convergence of NS-DPO in a general setting of not knowing the exact timing of preferences, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.
Paperid:1359
Authors:Attila Szász · Balázs Bánhelyi · Mark Jelasity
Abstract:
The ultimate goal of verification is to guarantee the safety of deployed neural networks. Here, we claim that all the stateof-the-art verifiers we are aware of fail to reach this goal. Our key insight is that theoretical soundness (bounding the full-precision output while computing with floating point) does not imply practical soundness (bounding the floating point output in a potentially stochastic environment). We prove this observation for the approaches that are currently used to achieve provable theoretical soundness, such as interval analysis and its variants. We also argue that achieving practical soundness is significantly harder computationally. We support our claims empirically as well by evaluating several well-known verification methods. To mislead the verifiers, we create adversarial networks that detect and exploit features of the deployment environment, such as the order and precision of floating point operations. We demonstrate that all the tested verifiers are vulnerable to our new deployment-specific attacks, which proves that they are not practically sound.
Paperid:1360
Authors:Yucheng Hu · Yanjiang Guo · Pengchao Wang · Xiaoyu Chen · Yen-Jen Wang · Jianke Zhang · Koushil Sreenath · Chaochao Lu · Jianyu Chen
Abstract:
Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pretrained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data.In experiments, VPP achieves a 41.5\% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6\% increase in success rates for complex real-world dexterous manipulation tasks. For your convenience, videos can be found at https://video-prediction-policy.github.io/
Paperid:1361
Authors:Jingyu Liu · Beidi Chen · Ce Zhang
Abstract:
Improving timeto-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. Because optimizing TTFT directly results in higher maximal QPS and meets the requirements of many critical applications. However, boosting TTFT is notoriously challenging since it is purely compute-bounded and the performance bottleneck shifts from the self-attention that many prior works focus on to the MLP part. In this work, we present SpecPrefill, a training free framework that accelerates the inference TTFT for both long and medium context queries based on the following insight: LLMs are generalized enough to still preserve the quality given only a carefully chosen subset of prompt tokens. At its core, SpecPrefill leverages a lightweight model to speculate locally important tokens based on the context. These tokens, along with the necessary positional information, are then sent to the main model for processing. We evaluate SpecPrefill with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies. SpecPrefill manages to serve Llama-3.1-405B-Instruct-FP8 with up to 7× maximal end-to-end QPS on real downstream tasks and 7.66× TTFT improvement during benchmarking.
Paperid:1362
Authors:Emile Pierret · Bruno Galerne
Abstract:
Diffusion or scorebased models recently showed high performance in image generation.They rely on a forward and a backward stochastic differential equations (SDE). The sampling of a data distribution is achieved by numerically solving the backward SDE or its associated flow ODE.Studying the convergence of these models necessitates to control four different types of error: the initialization error, the truncation error, the discretization error and the score approximation.In this paper, we theoretically study the behavior of diffusion models and their numerical implementation when the data distribution is Gaussian.Our first contribution is to derive the analytical solutions of the backward SDE and the probability flow ODE and to prove that these solutions and their discretizations are all Gaussian processes.Our second contribution is to compute the exact Wasserstein errors between the target and the numerically sampled distributions for any numerical scheme.This allows us to monitor convergence directly in the data space, while experimental works limit their empirical analysis to Inception features.
Paperid:1363
Authors:Hui Lan · Chang · Eleanor W Dillon · Vasilis Syrgkanis
Abstract:
We address the problem of estimating heterogeneous treatment effects in panel data, adopting the popular Differencein-Differences (DiD) framework under the conditional parallel trends assumption. We propose a novel doubly robust meta-learner for the Conditional Average Treatment Effect on the Treated (CATT), reducing the estimation to a convex risk minimization problem involving a set of auxiliary models. Our framework allows for the flexible estimation of the CATT, when conditioning on any subset of variables of interest using generic machine learning. Leveraging Neyman orthogonality, our proposed approach is robust to estimation errors in the auxiliary models. As a generalization to our main result, we develop a meta-learning approach for the estimation of general conditional functionals under covariate shift. We also provide an extension to the instrumented DiD setting with non-compliance. Empirical results demonstrate the superiority of our approach over existing baselines.
Paperid:1364
Authors:Dasol Hong · Wooju Lee · Hyun Myung
Abstract:
Prompt tuning, which adapts visionlanguage models by freezing model parameters and optimizing only the prompt, has proven effective for task-specific adaptations.The core challenge in prompt tuning is improving specialization for a specific task and generalization for unseen domains. However, frozen encoders often produce misaligned features, leading to confusion between classes and limiting specialization.To overcome this issue, we propose a confusion-aware loss (CoA-loss) that improves specialization by refining the decision boundaries between confusing classes.Additionally, we mathematically demonstrate that a mixture model can enhance generalization without compromising specialization.This is achieved using confidence-aware temperature (CoA-temp), which adjusts the weights of each prediction in the mixture model based on its confidence within the class domains.Extensive experiments show that CoCoA-Mix, a mixture model with CoA-loss and CoA-temp, outperforms state-of-the-art methods by enhancing specialization and generalization.
Paperid:1365
Authors:Chejian Xu · Mintong Kang · Jiawei Zhang · Zeyi Liao · Lingbo Mo · Mengqi Yuan · Huan Sun · Bo Li
Abstract:
Foundation modelbased agents are increasingly used to automate complex tasks, enhancing efficiency and productivity. However, their access to sensitive resources and autonomous decision-making also introduce significant security risks, where successful attacks could lead to severe consequences. To systematically uncover these vulnerabilities, we propose AdvAgent, a black-box red-teaming framework for attacking web agents. Unlike existing approaches, AdvAgent employs a reinforcement learning-based pipeline to train an adversarial prompter model that optimizes adversarial prompts using feedback from the black-box agent. With careful attack design, these prompts effectively exploit agent weaknesses while maintaining stealthiness and controllability. Extensive evaluations demonstrate that AdvAgent achieves high success rates against state-of-the-art GPT-4-based web agents across diverse web tasks. Furthermore, we find that existing prompt-based defenses provide only limited protection, leaving agents vulnerable to our framework. These findings highlight critical vulnerabilities in current web agents and emphasize the urgent need for stronger defense mechanisms.
Paperid:1366
Authors:Tian Li · Tianyi Zhou · Jeff Bilmes
Abstract:
SharpnessAware Minimization (SAM) has been demonstrated to improve the generalization performance of overparameterized models by seeking flat minima on the loss landscape through optimizing model parameters that incur the largest loss within a neighborhood. Nevertheless, such min-max formulations are computationally challenging especially when the problem is highly non-convex. Additionally, focusing only on the worst-case local solution while ignoring potentially many other local solutions may be suboptimal when searching for flat minima. In this work, we propose Tilted SAM (TSAM), a smoothed generalization of SAM inspired by exponential tilting that effectively assigns higher priority to local solutions that incur larger losses. TSAM is parameterized by a tilt hyperparameter t and reduces to SAM as t approaches infinity. We show that TSAM is smoother than SAM and thus easier to optimize, and it explicitly favors flatter minima. We develop algorithms motivated by the discretization of Hamiltonian dynamics to solve TSAM. Empirically, TSAM arrives at flatter local minima and results in superior test performance than the baselines of SAM and ERM across a range of image and text tasks.
Paperid:1367
Authors:Hao-Zhe Tan · Zhi Zhou · Lan-Zhe Guo · Yu-Feng Li
Abstract:
Pretrained Vision-Language Models (VLMs) are becoming increasingly popular across various visual tasks, and several open-sourced VLM variants have been released. However, selecting the best-performing pre-trained VLM for a specific downstream task is challenging since no single VLM can achieve promising performance on all downstream tasks, and evaluating all available VLMs is impossible due to time and data limitations. To address this problem, this paper proposes a novel paradigm to select and reuse VLM for downstream tasks, called Model Label Learning (MLL). The proposal contains three key modules: \emph{model labeling}, which assigns labels to each VLM to describe their specialty and utility; \emph{model selection}, which matches the requirements of the target task with model labels; and \emph{model reuse}, which applies selected VLMs to the target task in an ensemble manner. The proposal is highly computationally efficient and growable since the model labeling process is completed target task independent and the ability could grow with the number of candidate VLMs. We also introduce a new benchmark for evaluating VLM selection methods, including 49 VLMs and 17 target task datasets. Experimental results clearly demonstrate the effectiveness of the proposed method for selecting and reusing VLMs.
Paperid:1368
Authors:Joshua Kazdan · Rylan Schaeffer · Apratim Dey · Matthias Gerstgrasser · Rafael Rafailov · David Donoho · Sanmi Koyejo
Abstract:
What happens when generative machine learning models are pretrained on webscale datasets containing data generated by earlier models? Some prior work warns of “model collapse” as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {\it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {\it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test loss performance across generations. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.
Paperid:1369
Authors:ERIC EATON · Marcel Hussing · Michael Kearns · Aaron Roth · Sikata Sengupta · Jessica Sorrell
Abstract:
In traditional reinforcement learning (RL), the learner aims to solve a single objective optimization problem: find the policy that maximizes expected reward. However, in many realworld settings, it is important to optimize over multiple objectives simultaneously. For example, when we are interested in fairness, states might have feature annotations corresponding to multiple (intersecting) demographic groups to whom reward accrues, and our goal might be to maximize the reward of the group receiving the minimal reward. In this work, we consider a multi-objective optimization problem in which each objective is defined by a state-based reweighting of a single scalar reward function. This generalizes the problem of maximizing the reward of the minimum reward group. We provide oracle-efficient algorithms to solve these multi-objective RL problems even when the number of objectives is very large --- for tabular MDPs, as well as for large MDPs when the group functions have additional structure. The contribution of this paper is that we are able to solve this class of multi-objective RL problems with a possiblyexponentiallylarge class of constraints over intersecting groups in both tabular and large state space MDPs in an oracle-efficient manner. Finally, we experimentally validate our theoretical results and demonstrate applications on a preferential attachment graph MDP.
Paperid:1370
Authors:Onat Dalmaz · Arjun Desai · Reinhard Heckel · Tolga Cukur · Akshay Chaudhari · Brian Hargreaves
Abstract:
Accelerated MRI reconstruction involves solving an illposed inverse problem where noise in acquired data propagates to the reconstructed images. Noise analyses are central to MRI reconstruction for providing an explicit measure of solution fidelity and for guiding the design and deployment of novel reconstruction methods. However, deep learning (DL)-based reconstruction methods have often overlooked noise propagation due to inherent analytical and computational challenges, despite its critical importance. This work proposes a theoretically grounded, memory-efficient technique to calculatevoxel-wise variancefor quantifying uncertainty due to acquisition noise in accelerated MRI reconstructions. Our approach is based on approximating the noise covariance using the DL network's Jacobian, which is intractable to calculate. To circumvent this, we derive anunbiased estimatorfor the diagonal of this covariance matrix—voxel-wise variance—, and introduce a Jacobian sketching technique to efficiently implement it. We evaluate our method on knee and brain MRI datasets for both data-driven and physics-driven networks trained in supervised and unsupervised manners. Compared to empirical references obtained via Monte-Carlo simulations, our technique achieves near-equivalent performance while reducing computational and memory demands by an order of magnitude or more. Furthermore, our method is robust across varying input noise levels, acceleration factors, and diverse undersampling schemes, highlighting its broad applicability. Our workreintroducesaccurate and efficient noise analysis as a central tenet of reconstruction algorithms, holding promise to reshape how we evaluate and deploy DL-based MRI. Our code will be made publicly available upon acceptance.
Paperid:1371
Authors:Changshuo Liu · Lingze Zeng · Kaiping Zheng · Shaofeng Cai · Beng Chin Ooi · James Yip
Abstract:
Electronic health records (EHR) aggregate extensive data critical for advancing patient care and refining intervention strategies. EHR data is essential for epidemiological study, more commonly referred to as cohort study, where patients with shared characteristics or similar diseases are analyzed over time. Unfortunately, existing studies on cohort modeling are limited, struggling to derive finegrained cohorts or effectively utilize cohort information, which hinders their ability to uncover intrinsic relationships between cohorts. To this end, we propose NeuralCohort, a cohort-aware neural representation learning method that precisely segments patients into finer-grained cohorts via an innovative cohort contextualization mechanism and captures both intra- and inter-cohort information using a Biscale Cohort Learning Module. Designed as a plug-in, NeuralCohort integrates seamlessly with existing backbone models, enhancing their cohort analysis capabilities by infusing deep cohort insights into the representation learning processes. The effectiveness and generalizability of NeuralCohort are validated across extensive real-world EHR datasets. Experimental results demonstrate that NeuralCohort consistently improves the performance of various backbone models, achieving up to an 8.1% increase in AUROC.
Paperid:1372
Authors:Davide Adamo · Marco Corneli · Manon Vuillien · Emmanuelle Vila
Abstract:
Due to its invariance to rigid transformations such as rotations and reflections, ProcrustesWasserstein (PW) was introduced in the literature as an optimal transport (OT) distance, alternative to Wasserstein and more suited to tasks such as the alignment and comparison of point clouds. Having that application in mind, we carefully build a space of discrete probability measures and show that over that space PW actuallyisa distance. Algorithms to solve the PW problems already exist, however we extend the PW framework by discussing and testing several initialization strategies. We then introduce the notion of PW barycenter and detail an algorithm to estimate it from the data. The result is a new method to compute representative shapes from a collection of point clouds. We benchmark our method against existing OT approaches, demonstrating superior performance in scenarios requiring precise alignment and shape preservation. We finally show the usefulness of the PW barycenters in an archaeological context. Our results highlight the potential of PW in advancing 2D and 3D point cloud analysis for machine learning and computational geometry applications.
Paperid:1373
Authors:Alexander Wettig · Luca Soldaini · Kyle Lo · Sewon Min · Danqi Chen · Hannaneh Hajishirzi
Abstract:
Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web.The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pretraining data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance.We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.
Paperid:1374
Authors:Ziyuan Zhou · Guanjun Liu · Mengchu Zhou · Guo
Abstract:
The performance of models trained by MultiAgent Reinforcement Learning (MARL) is sensitive to perturbations in observations, lowering their trustworthiness in complex environments. Adversarial training is a valuable approach to enhance their performance robustness. However, existing methods often overfit to adversarial perturbations of observations and fail to incorporate prior information about the policy adopted by their protagonist agent, i.e., the primary one being trained. To address this important issue, this paper introduces Adversarial Training with Stochastic Adversary (ATSA), where the proposed adversary is trained online alongside the protagonist agent. The former consists of Stochastic Director (SDor) and SDor-guided generaTor (STor). SDor performs policy perturbations by minimizing the expected team reward of protagonists and maximizing the entropy of its policy, while STor generates adversarial perturbations of observations by following SDor's guidance. We prove that SDor's soft policy converges to a global optimum according to factorized maximum-entropy MARL and leads to the optimal adversary. This paper also introduces an SDor-STor loss function to quantify the difference between a) perturbations in the agent's policy and b) those advised by SDor. We evaluate our ATSA on StarCraft II tasks and autonomous driving scenarios, demonstrating that a) it is robust against diverse perturbations of observations while maintaining outstanding performance in perturbation-free environments, and b) it outperforms the state-of-the-art methods.
Paperid:1375
Authors:Halil Alperen Gozeten · Muhammed Emrullah Ildiz · Xuechen Zhang · Mahdi Soltanolkotabi · Marco Mondelli · Samet Oymak
Abstract:
Testtime training (TTT) methods explicitly update the weights of a model to adapt to the specific test instance, and they have found success in a variety of settings, including most recently language modeling and reasoning. To demystify this success, we investigate a gradient-based TTT algorithm for in-context learning, where we train a transformer model on the in-context demonstrations provided in the test prompt. Specifically, we provide a comprehensive theoretical characterization of linear transformers when the update rule is a single gradient step. Our theory (i) delineates the role of alignment between pretraining distribution and target task, (ii) demystifies how TTT can alleviate distribution shift, and (iii) quantifies the sample complexity of TTT including how it can significantly reduce the eventual sample size required for in-context learning. As our empirical contribution, we study the benefits of TTT for TabPFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost.
Paperid:1376
Authors:Jen-Tse Huang · Jiaxu Zhou · Tailin Jin · Xuhui Zhou · Zixi Chen · Wenxuan Wang · Youliang Yuan · Michael Lyu · Maarten Sap
Abstract:
Abstract:Large language modelbased multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents, each focusing on a specific domain.However, the impact of clumsy or even malicious agents—those who frequently make errors in their tasks—on the overall performance of the system remains underexplored.This paper investigates:(1) What is the resilience of various system structures (e.g., A$\rightarrow$B$\rightarrow$C, A$\leftrightarrow$B$\leftrightarrow$C) under faulty agents, on different downstream tasks?(2) How can we increase system resilience to defend against these agents?To simulate faulty agents, we propose two approaches—AutoTransform and AutoInject—which introduce mistakes into the agents' responses.We select four downstream tasks, including code generation, math problems, translation, and text evaluation.Results suggest that the ``hierarchical'' structure, i.e., A$\rightarrow$(B$\leftrightarrow$C), exhibits superior resilience with the lowest performance drop of $9.2\%$, compared to $26.0\%$ and $31.2\%$ of other two structures.Additionally, we improve the system resilience with two methods, introducing a mechanism for each agent to challenge others' outputs, and an additional agent to review and correct messages.Our code and data is available in the supplementary materials and will be made publicly available upon publication.
Paperid:1377
Authors:Jan Schuchardt · Mina Dalirrooyfard · Jed Guzelkabaagac · Anderson Schneider · Yuriy Nevmyvaka · Stephan Günnemann
Abstract:
Many forms of sensitive data, such as web traffic, mobility data, or hospital occupancy, are inherently sequential. The standard method for training machine learning models while ensuring privacy for units of sensitive information, such as individual hospital visits, is differentially private stochastic gradient descent (DPSGD). However, we observe in this work that the formal guarantees of DP-SGD are incompatible with time-series-specific tasks like forecasting, since they rely on the privacy amplification attained by training on small, unstructured batches sampled from an unstructured dataset. In contrast, batches for forecasting are generated by (1) sampling sequentially structured time series from a dataset, (2) sampling contiguous subsequences from these series, and (3) partitioning them into context and ground-truth forecast windows. We theoretically analyze the privacy amplification attained by this structured subsampling to enable the training of forecasting models with sound and tight event- and user-level privacy guarantees. Towards more private models, we additionally prove how data augmentation amplifies privacy in self-supervised training of sequence models. Our empirical evaluation demonstrates that amplification by structured subsampling enables the training of forecasting models with strong formal privacy guarantees.
Paperid:1378
Authors:Haonan Yuan · Qingyun Sun · Junhua Shi · Xingcheng Fu · Bryan Hooi · Jianxin Li · Philip Yu
Abstract:
Graph Foundation Models hold significant potential for advancing multidomain graph learning, yet their full capabilities remain largely untapped. Existing works show promising task performance with the “pretrain-then-prompt” paradigm, which lacks theoretical foundations to understand why it works and how much knowledge can be transferred from source domains to target. In this paper, we introduceBRIDGE, the novelBounded gRaph pre-training and prompt learnIng framework from multi-Domains withGeneralizationError. To learn discriminative source knowledge, we align multi-domain graph features with a set of domain-invariant aligners during pre-training. A lightweight Mixture of Experts (MoE) network is then proposed to facilitate downstream prompting through self-supervised selective knowledge assembly and transfer. Further, to determine the maximum amount of transferable knowledge, we derive an optimizable upper bound for the generalization error from a graph spectral perspective given the Lipschitz continuity. Extensive experiments demonstrate the superiority of BRIDGE on downstream node and graph classifications compared with 15 state-of-the-art baselines.
Paperid:1379
Authors:Katie Kang · Amrith Setlur · Dibya Ghosh · Jacob Steinhardt · Claire Tomlin · Sergey Levine · Aviral Kumar
Abstract:
Abstract:Modern large language models (LLMs) excel at fitting finetuning data, but often struggle on unseen examples. In order to teach models genuine reasoning abilities rather than superficial pattern matching, our work aims to better understand how the learning dynamics of LLM finetuning shapes downstream generalization. Our analysis focuses on reasoning tasks, whose problem structure allows us to distinguish between memorization (the exact replication of reasoning steps from the training data) and performance (the correctness of the final solution). We find that a model's performance on test prompts can be effectively characterized by a training metric we call prememorization train accuracy: the accuracy of model samples on training queries before they begin to copy the exact reasoning steps from the training set. On the dataset level, this metric is able to almost perfectly predict test accuracy, achieving $R^2$ of $\geq 0.9$ across various models (Llama3 8B, Gemma2 9B), datasets (GSM8k, MATH), and training configurations. On a per-example level, this metric is also indicative of whether individual model predictions are robust to perturbations in the training query. By connecting a model's learning dynamics to test performance, pre-memorization train accuracy can inform training decisions, such as the makeup of the training data. Our experiments on data curation show that prioritizing examples with low pre-memorization accuracy leads to 1.5-2x improvements in data efficiency compared to i.i.d. data scaling and other data scaling techniques.
Paperid:1380
Authors:Naram Mhaisen · George Iosifidis
Abstract:
We revisit the Follow the Regularized Leader (FTRL) framework for Online Convex Optimization (OCO) over compact sets, focusing on achieving dynamic regret guarantees. Prior work has highlighted the framework’s limitations in dynamic environments due to its tendency to produce "lazy" iterates. However, building on insights showing FTRL's ability to produce "agile” iterates, we show that it can indeed recover known dynamic regret bounds through optimistic composition of future costs and careful linearization of past costs, which can lead to pruning some of them. This new analysis of FTRL against dynamic comparators leads to key results, including refined control over regret terms, optimism without cyclic dependence, and the application of minimal recursive regularization akin to AdaFTRL. More broadly, we show that it is not the ``lazy" projection style of FTRL that hinders (optimistic) dynamic regret, but the decoupling of the algorithm’s state (linearized history) from its iterates, allowing the state to grow arbitrarily. Instead, pruning synchronizes these two when necessary.
Paperid:1381
Authors:Wenke Huang · Jian Liang · Guancheng Wan · Didi Zhu · He Li · Jiawei Shao · Mang Ye · Bo Du · Dacheng Tao
Abstract:
Finetuning Multimodal Large Language Models (MLLMs) in multi-task learning scenarios has emerged as an effective strategy for achieving cross-domain specialization. However, multi-task fine-tuning frequently induces performance degradation on open-response datasets. We posit that free-form answer generation primarily depends on language priors, and strengthening the integration of visual behavioral cues is critical for enhancing prediction robustness. In this work, we propose Noise Resilient Confidence Alignment to address the challenge of open-response overfitting during multi-task fine-tuning. Our approach prioritizes maintaining consistent prediction patterns in MLLMs across varying visual input qualities. To achieve this, we employ Gaussian perturbations to synthesize distorted visual inputs and enforce token prediction confidence alignment towards the normal visual branch. By explicitly linking confidence calibration to visual robustness, this method reduces over-reliance on language priors. We conduct extensive empirical evaluations across diverse multi-task downstream settings via popular MLLM architectures. The comprehensive experiment demonstrates the effectiveness of our method, showcasing its ability to alleviate open-response overfitting while maintaining satisfying multi-task fine-tuning performance.
Paperid:1382
Authors:Qingchuan Ma · Yuhang Wu · Xiawu Zheng · Rongrong Ji
Abstract:
In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: Γ measures basic reasoning accuracy, while ∆ quantifies a model's reliance on specific symbols rather than underlying patterns a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule-based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching. Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multi-agent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) ∆'s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization. These findings underscore that current LLMs, despite domain-specific strengths, still lack robust abstract reasoning, highlighting key areas for future improvement.
Paperid:1383
Authors:Zeqiong Lv · Chao Qian · Yun Liu · Jiahao Fan · Yanan Sun
Abstract:
Abstract:Evolutionary neural architecture search (ENAS) is a key part of evolutionary machine learning, which commonly utilizes evolutionary algorithms (EAs) to automatically design highperforming deep neural architectures. During past years, various ENAS methods have been proposed with exceptional performance. However, the theory research of ENAS is still in the infant. In this work, we step for the runtime analysis, which is an essential theory aspect of EAs, of ENAS upon multiclass classification problems. Specifically, we first propose a benchmark to lay the groundwork for the analysis. Furthermore, we design a two-level search space, making it suitable for multiclass classification problems and consistent with the common settings of ENAS. Based on both designs, we consider (1+1)-ENAS algorithms with one-bit and bit-wise mutations, and analyze their upper and lower bounds on the expected runtime. We prove that the algorithm using both mutations can find the optimum with the expected runtime upper bound of $O(rM\ln{rM})$ and lower bound of $\Omega(rM\ln{M})$. This suggests that a simple one-bit mutation can be greatly considered, given that most state-of-the-art ENAS methods are laboriously designed with the bit-wise mutation. Empirical studies also support our theoretical proof.
Paperid:1384
Authors:Le Tuyet Nhi PHAM · Dario Shariatian · Antonio Ocello · Giovanni Conforti · Alain Oliviero Durmus
Abstract:
Abstract:This paper introduces the Discrete Markov Probabilistic Model (DMPM), a novel algorithm for discrete data generation. The algorithm operates in the space of bits $ \\{0,1\\}^d $, where the noising process is a continuoustime Markov chain that can be sampled exactly via a Poissonian clock that flips labels uniformly at random. The time-reversal process, like the forward noise process, is a jump process, with its intensity governed by a discrete analogue of the classical score function. Crucially, this intensity is proven to be the conditional expectation of a function of the forward process, strengthening its theoretical alignment with score-based generative models while ensuring robustness and efficiency. We further establish convergence bounds for the algorithm under minimal assumptions and demonstrate its effectiveness through experiments on low-dimensional Bernoulli-distributed datasets and high-dimensional binary MNIST data. The results highlight its strong performance in generating discrete structures. This work bridges theoretical foundations and practical applications, advancing the development of effective and theoretically grounded discrete generative modeling.
Paperid:1385
Authors:Tiange Luo · Ang Cao · Gunhee Lee · Justin Johnson · Honglak Lee
Abstract:
Despite recent advances in VisionLanguage Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q\&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although, humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17\% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form ``good-bad" image pairs for self-training. Our training objectives compel VLMs to focus more on the actual visual inputs, and we demonstrate their effectiveness in boosting the performance of open-source VLMs, including LLaVA-v1.5 and Cambrian.
Paperid:1386
Authors:zhaowei chen · Borui Zhao · Yuchen Ge · Yuhao Chen · Renjie Song · Jiajun Liang
Abstract:
Online Knowledge Distillation (OKD) methods represent a streamlined, onestage distillation training process that obviates the necessity of transferring knowledge from a pretrained teacher network to a more compact student network. In contrast to existing logits-based OKD methods, this paper presents an innovative approach to leverage intermediate spatial representations. Our analysis of the intermediate features from both teacher and student models reveals two pivotal insights: (1) the similar features between students and teachers are predominantly focused on the foreground objects. (2) teacher models emphasize foreground objects more than students. Building on these findings, we propose Asymmetric Decision-Making (ADM) to enhance feature consensus learning for student models while continuously promoting feature diversity in teacher models. Specifically, Consensus Learning for student models prioritizes spatial features with high consensus relative to teacher models. Conversely, Divergence Learning for teacher models highlights spatial features with lower similarity compared to student models, indicating superior performance by teacher models in these regions. Consequently, ADM facilitates the student models to catch up with the feature learning process of the teacher models. Extensive experiments demonstrate that ADM consistently surpasses existing OKD methods across various online knowledge distillation settings and also achieves superior results when transferred to offline knowledge distillation, semantic segmentation and diffusion distillation tasks.
Paperid:1387
Authors:Xuanlei Zhao · Shenggan Cheng · Chang Chen · Zangwei Zheng · Ziming Liu · Zheming Yang · Yang You
Abstract:
Scaling multidimensional transformers to long sequences is indispensable across various domains. However, the challenges of large memory requirements and slow speeds of such sequences necessitate sequence parallelism. All existing approaches fall under the category of embedded sequence parallelism, which are limited to shard along a single sequence dimension, thereby introducing significant communication overhead. However, the nature of multi-dimensional transformers involves independent calculations across multiple sequence dimensions. To this end, we propose Dynamic Sequence Parallelism (DSP) as a novel abstraction of sequence parallelism. DSP dynamically switches the parallel dimension among all sequences according to the computation stage with efficient resharding strategy. DSP offers significant reductions in communication costs, adaptability across modules, and ease of implementation with minimal constraints. Experimental evaluations demonstrate DSP's superiority over state-of-the-art embedded sequence parallelism methods by remarkable throughput improvements ranging from 32.2% to 10x, with less than 25% communication volume.
Paperid:1388
Authors:Amrith Setlur · Nived Rajaraman · Sergey Levine · Aviral Kumar
Abstract:
Despite substantial improvements in LLM capabilities by scaling testtime compute, an ongoing debate in the community is how it should be scaled up so as to enable continued and efficient improvements with scaling. There are largely two approaches for this: first, distilling successful search procedures; and second, using verification (e.g., 0/1 correctness rewards or trained reward models, verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given an equal amount of data budget. Concretely, we show that suboptimality of VF methods scales poorly with test-time compute budget (measured as the output token length or horizon) compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdos, 1945]. This implies a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF methods widening as test-time compute budget grows. We corroborate our theoretical results empirically on both didactic and math reasoning problems with 3B/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
Paperid:1389
Authors:Qi Ju · Thomas Tellier · Meng Sun · Zhemei Fang · YunFeng Luo
Abstract:
Artificial intelligence (AI) has surpassed top human experts in various games. In imperfect information games, this success primarily relies on Counterfactual Regret Minimization (CFR) and its variants to solve Nash equilibrium. However, existing research has predominantly focused on optimality, overlooking the importance of strategy diversity and the demand for varied play styles, thereby limiting AI’s adaptability to different user preferences.To address this limitation, we propose PreferenceCFR (Pref-CFR), a novel approach that introduces two key parameters: preference degree and vulnerability degree. These parameters allow AI to flexibly adjust its strategy distribution within an acceptable loss threshold, enhancing its adaptability to diverse strategic requirements. In our experiments, we successfully trained Aggressive and Loose Passive strategies using Pref-CFR in a Texas Hold’em environment. These strategies achieve performance comparable to standard CFR-trained strategies but exhibit distinctly different play styles. Furthermore, for certain hands, Pref-CFR generates strategies that significantly deviate from conventional expert theories and CFR-derived strategies, potentially providing valuable insights for professional players.
Paperid:1390
Authors:Yinong O Wang · Nivedha Sivakumar · Falaah Arif Khan · Katherine Metcalf · Adam Golinski · Natalie Mackraz · Barry-John Theobald · Luca Zappella · Nicholas Apostoloff
Abstract:
The recent rapid adoption of large language models (LLMs) highlights the critical need for benchmarking their fairness. Conventional fairness metrics, which focus on discrete accuracybased evaluations (i.e., prediction correctness), fail to capture the implicit impact of model uncertainty (e.g., higher model confidence about one group over another despite similar accuracy). To address this limitation, we propose an uncertainty-aware fairness metric, UCerf, to enable a fine-grained evaluation of model fairness that is more reflective of the internal bias in model decisions. Furthermore, observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset with 31,756 samples for co-reference resolution, offering a more diverse and suitable benchmark for modern LLMs. Combining our metric and dataset, we provide insightful comparisons of eight open-source LLMs. For example, Mistral-8B exhibits suboptimal fairness due to high confidence in incorrect predictions, a detail overlooked by Equalized Odds but captured by UCerF. Overall, this work provides a holistic framework for LLM evaluation by jointly assessing fairness and uncertainty, enabling the development of more transparent and accountable AI systems.
Paperid:1391
Authors:Jeremy McMahan
Abstract:
Abstract:We study the computational complexity of approximating general constrained Markov decision processes. Our primary contribution is the design of a polynomial time $(0,\epsilon)$additive bicriteria approximation algorithm for finding optimal constrained policies across a broad class of recursively computable constraints, including almost-sure, chance, expectation, and their anytime variants. Matching lower bounds imply our approximation guarantees are optimal so long as $P \neq NP$. The generality of our approach results in answers to several long-standing open complexity questions in the constrained reinforcement learning literature. Specifically, we are the first to prove polynomial-time approximability for the following settings: policies under chance constraints, deterministic policies under multiple expectation constraints, policies under non-homogeneous constraints (i.e., constraints of different types), and policies under constraints for continuous-state processes.
Paperid:1392
Authors:Shengyu Feng · Yiming Yang
Abstract:
This work proposes a simple yet effective sampling framework for combinatorial optimization (CO). Our method builds on discrete Langevin dynamics (LD), an efficient gradientguided generative algorithm. However, we observed that directly applying LD often leads to limited exploration. To overcome this limitation, we propose the Regularized Langevin Dynamics (RLD), which enforces an expected distance between the sampled and current solutions, effectively avoiding local minima. We develop two CO solvers on top of RLD, one based on simulated annealing (SA) and the other one based on neural network (NN). Empirical results on three classical CO problems demonstrate that both of our methods can achieve comparable or better performance against the previous state-of-the-art (SOTA) SA and NN-based solvers. In particular, our SA algorithm reduces the running time of the previous SOTA SA method by up to 80\%, while achieving equal or superior performance. In summary, RLD offers a promising framework for enhancing both traditional heuristics and NN models to solve CO problems.
Paperid:1393
Authors:Zihan Zhou · Yang Zhou · Zijie Zhang · Lingjuan Lyu · Da Yan · Ruoming Jin · Dejing Dou
Abstract:
Machine unlearning (MU) aims to remove the influence of specific data points from trained models, enhancing compliance with privacy regulations. However, the vulnerability of basic MU models to malicious unlearning requests in adversarial learning environments has been largely overlooked. Existing adversarial MU attacks suffer from three key limitations: inflexibility due to predefined attack targets, inefficiency in handling multiple attack requests, and instability caused by non-convex loss functions. To address these challenges, we propose a Flexible, Efficient, and Stable Attack (FESA). First, leveraging Carathéodory’s theorem, we introduce a convex polyhedral approximation to identify points in the loss landscape where convexity approximately holds, ensuring stable attack performance. Second, inspired by simplex theory and John’s theorem, we develop a regular simplex detection technique that maximizes coverage over the parameter space, improving attack flexibility and efficiency. We theoretically derive the proportion of the effective parameter space occupied by the constructed simplex. We evaluate the attack success rate of our FESA method on real datasets against state-of-the-art machine unlearning attack methods.
Paperid:1394
Authors:Yiming Yang · Xiaoyuan Cheng · Daniel Giles · Sibo Cheng · Yi He · Xiao Xue · Boli Chen · Yukun Hu
Abstract:
Variational data assimilation estimates the dynamical system states by minimizing a cost function that fits the numerical models with the observational data. Although fourdimensional variational assimilation (4D-Var) is widely used, it faces high computational costs in complex nonlinear systems and depends on imperfect state-observation mappings. Deep learning (DL) offers more expressive approximators, while integrating DL models into 4D-Var is challenging due to their nonlinearities and lack of theoretical guarantees in assimilation results. In this paper, we propose \textit{Tensor-Var}, a novel framework that integrates kernel conditional mean embedding (CME) with 4D-Var to linearize nonlinear dynamics, achieving convex optimization in a learned feature space. Moreover, our method provides a new perspective for solving 4D-Var in a linear way, offering theoretical guarantees of consistent assimilation results between the original and feature spaces. To handle large-scale problems, we propose a method to learn deep features (DFs) using neural networks within the Tensor-Var framework. Experiments on chaotic systems and global weather prediction with real-time observations show that Tensor-Var outperforms conventional and DL hybrid 4D-Var baselines in accuracy while achieving a 10- to 20-fold speed improvement.
Paperid:1395
Authors:Prateek Gupta · Andrea Ferrari · Nabil Iqbal
Abstract:
The notion of duality that a given physical system can have two different mathematical descriptions -- is a key idea in modern theoretical physics. Establishing a duality in lattice statistical mechanics models requires the construction of a dual Hamiltonian and a map from the original to the dual observables. By using neural networks to parameterize these maps and introducing a loss function that penalises the difference between correlation functions in original and dual models, we formulate the process of duality discovery as an optimization problem. We numerically solve this problem and show that our framework can rediscover the celebrated Kramers-Wannier duality for the 2d Ising model, numerically reconstructing the known mapping of temperatures. We discuss future directions and prospects for discovering new dualities within this framework.
Paperid:1396
Authors:Ron Shapira Weber · shahar benishay · Shahaf E. Finder · Andrey Lavrinenko · Oren Freifeld
Abstract:
Fast and scalable alignment of time series is a fundamental challenge in many domains. The standard solution, Dynamic Time Warping (DTW), struggles with poor scalability and sensitivity to noise. We introduce TimePoint, a selfsupervised method that dramatically accelerates DTW-based alignment while typically improving alignment accuracy by learning keypoints and descriptors from synthetic data. Inspired by 2D keypoint detection but carefully adapted to the unique challenges of 1D signals, TimePoint leverages efficient 1D diffeomorphisms—which effectively model nonlinear time warping—to generate realistic training data. This adaptation, along with fully convolutional and wavelet convolutional architectures, enables the extraction of informative keypoints and descriptors. Applying DTW to these sparse representations yields major speedups and typically higher alignment accuracy than standard DTW applied to the full signals. Despite being trained solely on synthetic data, TimePoint generalizes well to real-world time series. Extensive experiments demonstrate that TimePoint consistently achieves faster and more accurate alignments than standard DTW, making it a scalable solution for time-series analysis. We will release the code and dataset upon acceptance.
Paperid:1397
Authors:Tianyu Cui · Song-Jun Xu · Artem Moskalev · Shuwei Li · Tommaso Mansi · Mangal Prakash · Rui Liao
Abstract:
Inferring Gene Regulatory Networks (GRNs) from gene expression data is crucial for understanding biological processes. While supervised models are reported to achieve high performance for this task, they rely on costly ground truth (GT) labels and risk learning genespecific biases—such as class imbalances of GT interactions—rather than true regulatory mechanisms. To address these issues, we introduce InfoSEM, an unsupervised generative model that leverages textual gene embeddings as informative priors, improving GRN inference without GT labels. InfoSEM can also integrate GT labels as an additional prior when available, avoiding biases and further enhancing performance. Additionally, we propose a biologically motivated benchmarking framework that better reflects real-world applications such as biomarker discovery and reveals learned biases of existing supervised methods. InfoSEM outperforms existing models by 38.5% across four datasets using textual embeddings prior and further boosts performance by 11.1% when integrating labeled data as priors.
Paperid:1398
Authors:Renpu Liu · Hao Wu · Jiawei Zhang · Xin Cheng · Xiangyang Luo · Bin Ma · Jinwei Wang
Abstract:
Adversarial examples have been shown to deceive Deep Neural Networks (DNNs), raising widespread concerns about this security threat. More seriously, as different DNN models share critical features, featurelevel attacks can generate transferable adversarial examples, thereby deceiving black-box models in real-world scenarios. Nevertheless, we have theoretically discovered the principle behind the limited transferability of existing feature-level attacks: Their attack effectiveness is essentially equivalent to perturbing features once in the feature space along the direction of feature importance, despite performing multiple perturbations in the pixel space. This finding indicates that existing feature-level attacks are inefficient in disrupting features through multiple pixel-space perturbations. To address this problem, we propose a P2FA that efficiently perturbs features multiple times. Specifically, we directly shift the perturbed space from pixel to feature space. Then, we perturb the features multiple times rather than just once in the feature space with the guidance of feature importance to enhance the efficiency of disrupting critical shared features. Finally, we invert the perturbed features to the pixels to generate more transferable adversarial examples. Numerous experimental results strongly demonstrate the superior transferability of P2FA over SOTA attacks.
Paperid:1399
Authors:Weimin Wu · Maojiang Su · Jerry Yao-Chieh Hu · Zhao Song · Han Liu
Abstract:
Abstract:We investigate the transformer's capability for incontext learning (ICL) to simulate the training process of deep models. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a $(2N+4)L$-layer transformer capable of simulating $L$ gradient descent steps of an $N$-layer ReLU network through ICL.We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent.Additionally, we extend our analysis to the more practical setting using softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks.The results show that ICL performance matches that of direct training.
Paperid:1400
Authors:Longhui Zhang · Bin Wang · Jiahao Wang · Xiaofeng Zhao · Min Zhang · Hao yang · Meishan Zhang · YU LI · Jing Li · Jun Yu · Min Zhang
Abstract:
Large language models (LLMs) have made significant strides in code translation tasks. However, ensuring both the correctness and readability of translated code remains a challenge, limiting their effective adoption in realworld software development. In this work, we propose F2STrans, a function-to-style guiding paradigm designed to progressively improve the performance of LLMs in code translation. Our approach comprises two key stages: (1) Functional learning, which optimizes translation correctness using high-quality source-target code pairs mined from online programming platforms, and (2) Style learning, which improves translation readability by incorporating both positive and negative style examples. Additionally, we introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations, enabling comprehensive functional and stylistic evaluations. Experiments on both our new benchmark and existing datasets demonstrate that our approach significantly improves code translation performance. Notably, our approach enables Qwen-1.5B to outperform prompt-enhanced Qwen-32B and GPT-4 on average across 20 diverse code translation scenarios.
Paperid:1401
Authors:Ken Ziyu Liu · Christopher A. Choquette Choo · Matthew Jagielski · Peter Kairouz · Sanmi Koyejo · Nicolas Papernot · Percy Liang
Abstract:
Abstract:An important question today is whether a given text was used to train a large language model (LLM). A \emph{completion} test is often employed: check if the LLM completes a sufficiently complex text.But, we require a groundtruth definition of membership; most commonly, it is defined as a member based on the \ngram overlap between the target text and any text in the dataset.In this work, we demonstrate that this $n$-gram based membership definition can be effectively gamed.We study scenarios where sequences are \emph{non-members} for a given $n$ and we find that completion tests still succeed. We find many natural cases of this by retraining LLMs after removing all training samples that were completed: these cases include exact duplicates, near-duplicates, and even short overlaps; they showcase that it is difficult to find a single viable choice of $n$. Using these insights, we design adversarial datasets that can cause any target sequences to be completed without containing it, for any reasonable choice of $n$.Our findings highlight the inadequacy of $n$-gram membership, suggesting membership definitions fail to account for auxiliary information available to the training algorithm.
Paperid:1402
Authors:Maricela Best Mckay · Avleen Kaur · Chen Greif · Brian Wetton
Abstract:
Recently, natural gradient methods have emerged as a promising alternative to gradient descent for problems in machine learning, where update directions in the parameter space pose optimization challenges. One such class of problems that can be notoriously challenging to optimize is physicsinformed neural networks (PINNs). Natural gradient methods for PINNs have achieved state-of-the-art performance with errors several orders of magnitude smaller than those achieved by standard optimizers such as ADAM or L-BFGS. However, computing natural gradients for PINNs is prohibitively computationally costly and memory-intensive for all but small neural network architectures. We develop a randomized algorithm for natural gradient descent for PINNs that uses sketching to approximate the natural gradient descent direction. We prove that the change of coordinate Gram matrix used in a natural gradient descent update has rapidly-decaying eigenvalues for a one-layer, one-dimensional neural network and empirically demonstrate that this structure holds for five different example problems. Under this structure, our sketching algorithm is guaranteed to provide a near-optimal low-rank approximation of the Gramian. Our algorithm dramatically speeds up computation time and reduces memory overhead. Additionally, in our experiments, the sketched natural gradient outperforms the original natural gradient in terms of accuracy, often achieving better error over an order of magnitude. Training time for a network with around 5,000 parameters is reduced from several hours to under two minutes. Training can be practically scaled to large network sizes; we optimize a PINN for a network with over a million parameters within a few minutes, a task for which the full Gram matrix does not fit in memory.
Paperid:1403
Authors:Zehong Wang · Zheyuan Zhang · Tianyi MA · Nitesh Chawla · Chuxu Zhang · Yanfang Ye
Abstract:
Graph learning tasks require models to comprehend essential substructure patterns relevant to downstream tasks, such as triadic closures in social networks and benzene rings in molecular graphs. Due to the nonEuclidean nature of graphs, existing graph neural networks (GNNs) rely on message passing to iteratively aggregate information from local neighborhoods. Despite their empirical success, message passing struggles to identify fundamental substructures, such as triangles, limiting its expressiveness. To overcome this limitation, we propose the Neural Graph Pattern Machine (GPM), a framework designed to learn directly from graph patterns. GPM efficiently extracts and encodes substructures while identifying the most relevant ones for downstream tasks. We also demonstrate that GPM offers superior expressivity and improved long-range information modeling compared to message passing. Empirical evaluations on node classification, link prediction, graph classification, and regression show the superiority of GPM over state-of-the-art baselines. Further analysis reveals its desirable out-of-distribution robustness, scalability, and interpretability. We consider GPM to be a step toward going beyond message passing.
Paperid:1404
Authors:Muru Zhang · Mayank Mishra · Zhongzhu Zhou · William Brandon · Jue Wang · Yoon Kim · Jonathan Ragan-Kelley · Shuaiwen Song · Ben Athiwaratkun · Tri Dao
Abstract:
Large language model inference is both memoryintensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication.Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation.While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.
Paperid:1405
Authors:Alexander Moreno · Justin Xiao · Jonathan Mei
Abstract:
Abstract:Structured Kernel Interpolation (SKI) (Wilson et al. 2015) helps scale Gaussian Processes (GPs) by approximating the kernel matrix via interpolation at inducing points, achieving linear computational complexity. However, it lacks rigorous theoretical error analysis. This paper bridges the gap: we prove error bounds for the SKI Gram matrix and examine the error's effect on hyperparameter estimation and posterior inference. We further provide a practical guide to selecting the number of inducing points under convolutional cubic interpolation: they should grow as $n^{d/3}$ for error control. Crucially, we identify two dimensionality regimes governing the tradeoff between SKI Gram matrix spectral norm error and computational complexity. For $d \leq 3$, *any* error tolerance can achieve linear time for sufficiently large sample size. For $d > 3$, the error must *increase* with sample size to maintain linear time. Our analysis provides key insights into SKI's scalability-accuracy trade-offs, establishing precise conditions for achieving linear-time GP inference with controlled approximation error.
Paperid:1406
Authors:Baohong Li · Yingrong Wang · Anpeng Wu · ma ming · Ruoxuan Xiong · Kun Kuang
Abstract:
Generalizing treatment effects from Randomized Controlled Trials (RCTs) across different environments holds significant practical value due to their long duration and high cost. A common challenge in generalizing RCTs across environments is the problem of environmental shifts, such as shifts in the distribution and quantity of covariates. A common solution to the challenge of environmental shifts is combining and matching RCT data from the source population with observational data from the target population using a separating set. However, the problem with the common solution is that it relies on the assumption that the covariates shared by both groups contain the separating set, which is difficult to satisfy in realworld scenarios under experimental shifts. In this paper, we propose a novel Two-Stage Doubly Robust (2SDR) solution to the above challenge, relaxing the assumption made in the common solution. 2SDR only assumes that variables from the separating set are present in at least one of the two data groups. It leverages shadow variables to impute the missing covariates for generalizing treatment effects from RCTs across environments in a two-stage way. Our solution is theoretically guaranteed by identification theory, and its effectiveness is supported by extensive experimental evidence on both synthetic and real-world datasets.
Paperid:1407
Authors:Xiaoli Tang · Han Yu · Zengxiang Li · Xiaoxiao Li
Abstract:
Auctionbased Federated Learning (AFL) has emerged as an important research field in recent years. The prevailing strategies for FL data consumers (DCs) assume that the entire team of the required data owners (DOs) for an FL task must be assembled before training can commence. In practice, a DC can trigger the FL training process multiple times. DOs can thus be gradually recruited over multiple FL model training sessions. Existing bidding strategies for AFL DCs are not designed to handle such scenarios. Therefore, the problem of multi-session AFL remains open. To address this problem, we propose the Multi-session Budget Optimization Strategy for forward Auction-based Federated Learning (MBOS-AFL). Based on hierarchical reinforcement learning, MBOS-AFL jointly optimizes inter-session budget pacing and intra-session bidding for AFL DCs, with the objective of maximizing the total utility. Extensive experiments on six benchmark datasets show that it significantly outperforms seven state-of-the-art approaches. On average, \methodname{} achieves 12.28% higher utility, 14.52% more data acquired through auctions for a given budget, and 1.23% higher test accuracy achieved by the resulting FL model compared to the best baseline. To the best of our knowledge, it is the first budget optimization decision support method with budget pacing capability designed for DCs in multi-session forward auction-based FL.
Paperid:1408
Authors:Ziliang Chen · Zhao-Rong Lai · Yufeng Yang · Liangda Fang · ZHANFU YANG · Liang Lin
Abstract:
Despite advancing language model (LM) alignment, direct preference optimization (DPO) falls short in reasoning with free lunch. As the breakthrough, this work propose a new preference optimization framework incorporating the other LM policy, which holds the asymptotic equivalence with AlphaZerolike search, the apex in complex reasoning. The preference optimization scheme discards any value or reward modeling while the neural implicit tree search coupled with DPO's policy, remains enabling LLM-based reasoning as AlphaZero. Our empirical studies confirm that our method outperforms both DPO variants in human preference alignment, and MCTS-based LMs in mathematical reasoning, and planning tasks.
Paperid:1409
Authors:Maya Pavlova · Erik Brinkman · Krithika Iyer · Vítor Albiero · Joanna Bitton · Hailey Nguyen · Cristian Canton · Ivan Evtimov · Aaron Grattafiori
Abstract:
Red teaming aims to assess how large language models (LLMs) can produce content that violates norms, policies, and rules set forth during their safety training. However, most existing automated methods in literature are not representative of the way common users exploit the multiturn conversational nature of AI models. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vuLnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general purpose model in a way that encourages reasoning through the choices of methods available, the current target model’s response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 96% against smaller models such as Llama 3.1 8B, and 91% against Llama 3.1 70B and 94% for GPT-4o when evaluated against larger models on the JailbreakBench dataset.
Paperid:1410
Authors:Xiang Zhang · Run He · Chen Jiao · Di Fang · Ming Li · Ziqian Zeng · Cen Chen · HUIPING ZHUANG
Abstract:
Classincremental learning (CIL) enables models to learn new classes continually without forgetting previously acquired knowledge. Multi-label CIL (MLCIL) extends CIL to a real-world scenario where each sample may belong to multiple classes, introducing several challenges: label absence, which leads to incomplete historical information due to missing labels, and class imbalance, which results in the model bias toward majority classes. To address these challenges, we propose Label-Augmented Analytic Adaptation (L3A), an exemplar-free approach without storing past samples. L3A integrates two key modules. The pseudo-label (PL) module implements label augmentation by generating pseudo-labels for current phase samples, addressing the label absence problem. The weighted analytic classifier (WAC) derives a closed-form solution for neural networks. It introduces sample-specific weights to adaptively balance the class contribution and mitigate class imbalance. Experiments on MS-COCO and PASCAL VOC datasets demonstrate that L3A outperforms existing methods in MLCIL tasks.
Paperid:1411
Authors:Abhishek Tyagi · Arjun Iyer · William Renninger · Christopher Kanan · Yuhao Zhu
Abstract:
Recent advances in Dynamic Sparse Training (DST) have pushed the frontier of sparse neural network training in structured and unstructured contexts, matching densemodel performance while drastically reducing parameter counts to facilitate model scaling. However, unstructured sparsity often fails to translate into practical speedups on modern hardware. To address this shortcoming, we propose DynaDiag, a novel structured sparse-to-sparse DST method that performs at par with unstructured sparsity. DynaDiag enforces a diagonal sparsity pattern throughout training and preserves sparse computation in forward and backward passes. We further leverage the diagonal structure to accelerate computation via a custom CUDA kernel, rendering the method hardware-friendly. Empirical evaluations on diverse neural architectures demonstrate that our method maintains accuracy on par with unstructured counterparts while benefiting from tangible computational gains. Notably, with 90\% sparse linear layers in ViTs, we observe up to a 3.13x speedup in online inference without sacrificing model performance and a 1.59x speedup in training on a GPU compared to equivalent unstructured layers.
Paperid:1412
Authors:Cassidy Laidlaw · Eli Bronstein · Timothy Guo · Dylan Feng · Lukas Berglund · Justin Svegliato · Stuart Russell · Anca Dragan
Abstract:
Abstract:Assistance games are a promising alternative to reinforcement learning from human feedback (RLHF) for training AI assistants. Assistance games resolve key drawbacks of RLHF, such as incentives for deceptive behavior, by explicitly modeling the interaction between assistant and user as a twoplayer game where the assistant cannot observe their shared goal. Despite their potential, assistance games have only been explored in simple settings. Scaling them to more complex environments is difficult because it requires both solving intractable decision-making problems under uncertainty and accurately modeling human users' behavior. We present the first scalable approach to solving assistance games and apply it to a new, challenging Minecraft-based assistance game with over $10^{400}$ possible goals. Our approach, AssistanceZero, extends AlphaZero with a neural network that predicts human actions and rewards, enabling it to plan under uncertainty. We show that AssistanceZero outperforms model-free RL algorithms and imitation learning in the Minecraft-based assistance game. In a human study, our AssistanceZero-trained assistant significantly reduces the number of actions participants take to complete building tasks in Minecraft. Our results suggest that assistance games are a tractable framework for training effective AI assistants in complex environments. Code and videos are available at https://anonymous.4open.science/w/scalably-solving-assistance-games/.
Paperid:1413
Authors:Deming Sheng · Ricardo Henao
Abstract:
Probabilistic survival analysis models seek to estimate the distribution of the future occurrence (time) of an event given a set of covariates.In recent years, these models have preferred nonparametric specifications that avoid directly estimating survival distributions via discretization.Specifically, they estimate the probability of an individual event at fixed times or the time of an event at fixed probabilities (quantiles), using supervised learning.Borrowing ideas from the quantile regression literature, we propose a parametric survival analysis method based on the Asymmetric Laplace Distribution (ALD).This distribution allows for closedform calculation of popular event summaries such as mean, median, mode, variation, and quantiles.The model is optimized by maximum likelihood to learn, at the individual level, the parameters (location, scale, and asymmetry) of the ALD distribution.Extensive results on synthetic and real-world data demonstrate that the proposed method outperforms parametric and nonparametric approaches in terms of accuracy, discrimination and calibration.
Paperid:1414
Authors:Fabian Schaipp · Alexander Hägele · Adrien Taylor · Umut Simsekli · Francis Bach
Abstract:
We show that learningrate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms.Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.
Paperid:1415
Authors:Siwei Xia · Li Sun · Tiantian Sun · Qingli Li
Abstract:
Dragbased editing within pretrained diffusion model provides a precise and flexible way to manipulate foreground objects. Traditional methods optimize the input feature obtained from DDIM inversion directly, adjusting them iteratively to guide handle points towards target locations. However, these approaches often suffer from limited accuracy due to the low representation ability of the feature in motion supervision, as well as inefficiencies caused by the large search space required for point tracking. To address these limitations, we present DragLoRA, a novel framework that integrates LoRA (Low-Rank Adaptation) adapters into the drag-based editing pipeline. DragLoRA enables online optimization during motion supervision and utilizes its output for handle point tracking. To enhance the training of LoRA adapters, we introduce an additional denoising score distillation loss which regularizes the online model by aligning its output with that of the original model. Additionally, we improve the consistency of motion supervision by adapting the input features using the updated LoRA, giving a more stable and accurate input feature for subsequent operations. Building on this, we design an adaptive optimization scheme that dynamically toggles between two modes, prioritizing efficiency without compromising precision. Extensive experiments demonstrate that DragLoRA significantly enhances the control precision and computational efficiency for drag-based image editing.
Paperid:1416
Authors:Keyue Qiu · Yuxuan Song · Zhehuan Fan · Peidong Liu · Zhe Zhang · Mingyue Zheng · Hao Zhou · Wei-Ying Ma
Abstract:
StructureBased Drug Design (SBDD) is crucial for identifying bioactive molecules. Recent deep generative models are faced with challenges in geometric structure modeling. A major bottleneck lies in the twisted probability path of multi-modalities—continuous 3D positions and discrete 2D topologies—which jointly determine molecular geometries. By establishing the fact that noise schedules decide the Variational Lower Bound (VLB) for the twisted probability path, we propose VLB-Optimal Scheduling (VOS) strategy in this under-explored area, which optimizes VLB as a path integral for SBDD. Our model effectively enhances molecular geometries and interaction modeling, achieving state-of-the-art PoseBusters passing rate of 95.9\% on CrossDock, more than 10\% improvement upon strong baselines, while maintaining high affinities and robust intramolecular validity evaluated on held-out test set.
Paperid:1417
Authors:Siru Zhong · Weilin Ruan · Ming Jin · Huan Li · Qingsong Wen · Yuxuan Liang
Abstract:
Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks finegrained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose Time-VLM, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments across diverse datasets demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting.
Paperid:1418
Authors:Fangyikang Wang · Hubery Yin · Shaobin Zhuang · Huminhao Zhu · Yinan Li · Lei Qian · Chao Zhang · Hanbin Zhao · Hui Qian · Chen Li
Abstract:
Recent Diffusion models (DMs) advancements have explored incorporating the secondorder diffusion Fisher information (DF), defined as the negative Hessian of log density, into various downstream tasks and theoretical analysis.However, current practices typically approximate the diffusion Fisher by applying auto-differentiation to the learned score network. This black-box method, though straightforward, lacks any accuracy guarantee and is time-consuming.In this paper, we show that the diffusion Fisher actually resides within a space spanned by the outer products of score and initial data.Based on the outer-product structure, we develop two efficient approximation algorithms to access the trace and matrix-vector multiplication of DF, respectively.These algorithms bypass the auto-differentiation operations with time-efficient vector-product calculations. Furthermore, we establish the approximation error bounds for the proposed algorithms.Experiments in likelihood evaluation and adjoint optimization demonstrate the superior accuracy and reduced computational cost of our proposed algorithms.Additionally, based on the novel outer-product formulation of DF, we design the first numerical verification experiment for the optimal transport property of the general PF-ODE deduced map.
Paperid:1419
Authors:Wang Hanting · Tao Jin · Wang Lin · Shulei Wang · Hai Huang · Shengpeng Ji · Zhou Zhao
Abstract:
Bridge models in image restoration construct a diffusion process from degraded to clear images. However, existing methods typically require training a bridge model from scratch for each specific type of degradation, resulting in high computational costs and limited performance. This work aims to efficiently leverage pretrained generative priors within existing image restoration bridges to eliminate this requirement. The main challenge is that standard generative models are typically designed for a diffusion process that starts from pure noise, while restoration tasks begin with a lowquality image, resulting in a mismatch in the state distributions between the two processes. To address this challenge, we propose a transition equation that bridges two diffusion processes with the same endpoint distribution. Based on this, we introduce theIRBridgeframework, which enables the direct utilization of generative models within image restoration bridges, offering a more flexible and adaptable approach to image restoration. Extensive experiments on six image restoration tasks demonstrate that IRBridge efficiently integrates generative priors, resulting in improved robustness and generalization performance. Code will be available at GitHub.
Paperid:1420
Authors:Jiayu Zhang · Xinyi Wang · Zhibo Jin · Zhiyu Zhu · Jianlong Zhou · Fang Chen · Huaming Chen
Abstract:
Outof-distribution (OOD) detection is essential for enhancing the robustness and security of deep learning models in unknown and dynamic data environments. Gradient-based OOD detection methods, such as GAIA, analyse the explanation pattern representations of in-distribution (ID) and OOD samples by examining the sensitivity of model outputs w.r.t. model inputs, resulting in superior performance compared to traditional OOD detection methods. However, we argue that the non-zero gradient behaviors of OOD samples do not exhibit significant distinguishability, especially when ID samples are perturbed by random noise in high-dimensional spaces, which negatively impacts the accuracy of OOD detection. In this paper, we propose a novel OOD detection method calledS \& Ibased on layerSplitting and gradientIntegration via Adversarial Gradient Attribution. Specifically, our approach involves splitting the model's intermediate layers and iteratively updating adversarial examples layer-by-layer. We then integrate the attribution gradients from each intermediate layer along the attribution path from adversarial examples to the actual input, yielding true explanation pattern representations for both ID and OOD samples. Experiments demonstrate that our S \& I algorithm achieves state-of-the-art results, with the average FPR95 of 29.05% (38.61%) and 37.31% on the CIFAR100 and ImageNet benchmarks, respectively. Our code is available at: https://anonymous.4open.science/r/S-I-F6F7/
Paperid:1421
Authors:Zichao Li · Xueru Wen · Jie Lou · Yuqiu Ji · Yaojie Lu · Xianpei Han · Debing Zhang · Le Sun
Abstract:
Multimodal Reward Models (MMRMs) are crucial for aligning Large Language Models (LLMs) with human preferences, particularly as LLMs increasingly interact with multimodal data. However, we find that MM-RMs trained on existing datasets often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations, primarily text-only shortcuts within the training distribution, which prevents them from leveraging true multimodal reward functions. To address this, we introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples, shifting the distribution toward better multimodal understanding, and reducing dependence on unimodal spurious correlations. Our experiments demonstrate significant improvements in generalization, downstream task performance, and scalability, establishing a more robust framework for multimodal reward modeling.
Paperid:1422
Authors:Toufique Ahmed · Jatin Ganhotra · Rangeet Pan · Avraham Shinnar · Saurabh Sinha · Martin Hirzel
Abstract:
While there has been plenty of work on generating tests from existing code, there has been limited work on generating tests from issues. A correct test must validate the code patch that resolves the issue. In this work, we focus on the scenario where the code patch does not exist yet. This approach supports two major usecases. First, it supports TDD (test-driven development), the discipline of “test first, write code later” that has well-documented benefits for human software engineers. Second, it also validates SWE (software engineering) agents, which generate code patches for resolving issues. This paper introduces Otter, an LLM-based solution for generating tests from issues. Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planning stage. Experiments show Otter outperforming state-of-the-art systems for generating tests from issues, in addition to enhancing systems that generate patches from issues. We hope that Otter helps make developers more productive at resolving issues and leads to more robust, well-tested code.
Paperid:1423
Authors:Botao Chen · Jongyeong Lee · Junya Honda
Abstract:
Abstract:This paper studies the complexity and optimality of Followthe-Perturbed-Leader (FTPL) policy in the $K$-armed bandit problems. FTPL is a promising policy that achieves the Best-of-Both-Worlds (BOBW) guarantee without solving an optimization problem unlike Follow-the-Regularized-Leader (FTRL). However, FTPL needs a procedure called geometric resampling to estimate the loss, which needs $O(K^2)$ per-round average complexity, usually worse than that of FTRL. To address this issue, we propose a novel technique, which we call Conditional Geometric Resampling (CGR), for unbiased loss estimation applicable to general perturbation distributions. CGR reduces the average complexity to $O(K\log K)$ without sacrificing the regret bounds. We also propose a biased version of CGR that can control the worst-case complexity while keeping the BOBW guarantee for a certain perturbation distribution. We confirm through experiments that CGR does not only significantly improve the average and worst-case runtime but also achieve better regret thanks to the stable loss estimation.
Paperid:1424
Authors:Divyat Mahajan · Mohammad Pezeshki · Charles Arnal · Ioannis Mitliagkas · Kartik Ahuja · Pascal Vincent
Abstract:
Compositional generalization is a crucial step towards developing dataefficient intelligent machines that generalize in human-like ways. In this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model's ability to generalize compositionally to novel attribute combinations in discriminative tasks. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.
Paperid:1425
Authors:Chenze Shao · Fandong Meng · Jie Zhou
Abstract:
Conventional wisdom suggests that autoregressive models are used to process discrete data. When applied to continuous modalities such as visual data, Visual AutoRegressive modeling (VAR) typically resorts to quantizationbased approaches to cast the data into a discrete space, which can introduce significant information loss. To tackle this issue, we introduce a Continuous VAR framework that enables direct visual autoregressive generation without vector quantization. The underlying theoretical foundation is strictly proper scoring rules, which provide powerful statistical tools capable of evaluating how well a generative model approximates the true distribution. Within this framework, all we need is to select a strictly proper score and set it as the training objective to optimize. We primarily explore a class of training objectives based on the energy score, which is likelihood-free and thus overcomes the difficulty of making probabilistic predictions in the continuous space. Previous efforts on continuous autoregressive generation, such as GIVT and diffusion loss, can also be derived from our framework using other strictly proper scores. We will release the source code upon publication.
Paperid:1426
Authors:Matei Gabriel Cosa · Marek Elias
Abstract:
Abstract:Combining algorithms is one of the key techniques in learningaugmented algorithms.We consider the following problem:We are given $\ell$ heuristicsfor Metrical Task Systems (MTS), where each might be tailored to a different typeof input instances.While processing an input instance received online,we are allowed to query the action of only one of the heuristics at each time step.Our goal is to achieve performance comparable to the best of the given heuristics.The main difficulty of our setting comes from the fact thatthe cost paid by a heuristic at time $t$ cannot be estimatedunless the same heuristic was also queried at time $t-1$.This is related to Bandit Learning against memory boundedadversaries (Arora et al., 2012).We show how to achieve regret of $O(\text{OPT}^{2/3})$and prove a tight lower bound based on the constructionof Dekel et al. (2013).
Paperid:1427
Authors:Nghiem Diep · Huy Nguyen · Chau Nguyen · Minh Le · Duy Nguyen · Daniel Sonntag · Mathias Niepert · Nhat Ho
Abstract:
LLaMAAdapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.
Paperid:1428
Authors:Ekin Akyürek · Mehul Damani · Adam Zweiger · Linlu Qiu · Han Guo · Jyothish Pari · Yoon Kim · Jacob Andreas
Abstract:
Abstract:Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of incontext task examples. We investigate the effectiveness of test-time training (TTT)—temporarily updating model parameters during inference using a loss derived from input data—as a mechanism for improving LMs' reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to $6\times$ higher accuracy compared to fine-tuned baselines—reaching $53.0\%$ on the public validation set with an 8B-parameter LM and $61.9\%$ when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the $10$-shot setting by $7.3$ percentage points ($50.5\%$ to $57.8\%$). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.
Paperid:1429
Authors:Tao · Darshil Doshi · Dayal Singh Kalra · Tianyu He · Maissam Barkeshli
Abstract:
Abstract:Transformers excel at discovering patterns in sequential data, but understanding their limits and how they learn remains a crucial topic. In this paper we investigate the ability of Transformers to learn pseudorandom number sequences from linear congruential generators (LCGs), defined by the recurrence relation $x_{n+1} = a x_n + c \\;\\mathrm{mod}\\; m$. Our analysis reveals that with sufficient architectural depth, context length, and training data variety, Transformers can predict LCG sequences with unseen moduli ($m$) and parameters ($a,c$). We demonstrate successful learning up to $m = 2^{32}$, and we perform systematic analyses of the effect of dataset complexity, architecture size, and context lengths. We uncover emergent structures in the embedding layers and attention heads, which we use to reverse engineer the underlying algorithms used by the Transformer. We first consider training and testing on a single modulus, which reveals that models learn to factorize the modulus and subsequently utilize digit-wise number representations to make sequential predictions. When generalizing to unseen moduli, the models employ a greedy approach to estimate the unknown modulus from the context and then use prime factorizations to generate predictions. In the latter case, we observe a transition at a critical depth $d = 3$, below which the Transformer is incapable of learning. We also find that the number of in-context sequence elements needed to reach high accuracy scales sublinearly with the period.
Paperid:1430
Authors:Duc Anh Nguyen · Ernesto Araya · Adalbert Fono · Gitta Kutyniok
Abstract:
Recent years have seen significant progress in developing spiking neural networks (SNNs) as a potential solution to the energy challenges posed by conventional artificial neural networks (ANNs). However, our theoretical understanding of SNNs remains relatively limited compared to the evergrowing body of literature on ANNs. In this paper, we study a discrete-time model of SNNs based on leaky integrate-and-fire (LIF) neurons, referred to as discrete-time LIF-SNNs, a widely used framework that still lacks solid theoretical foundations. We demonstrate that discrete-time LIF-SNNs realize piecewise constant functions defined on polyhedral regions, and more importantly, we quantify the network size required to approximate continuous functions. Moreover, we investigate the impact of latency (number of time steps) and depth (number of layers) on the complexity of the input space partitioning induced by discrete-time LIF-SNNs. Our analysis highlights the importance of latency and contrasts these networks with ANNs that use piecewise linear activation functions. Finally, we present numerical experiments to support our theoretical findings.
Paperid:1431
Authors:Hongzhan Chen · Tao Yang · Shiping Gao · Ruijun Chen · Xiaojun Quan · Hongtao Tian · Ting Yao
Abstract:
Progress reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from steplevel to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12× faster than ORM on GSM8K and 11× faster than step-level PRM on MATH. Code and data are included in the supplementary material and will be made publicly available.
Paperid:1432
Authors:Shiwei Li · Xiandi Luo · Haozhao Wang · Xing Tang · Shijie Xu · weihongluo · Yuhua Li · xiuqiang He · Ruixuan Li
Abstract:
To improve the training efficiency of federated learning (FL), previous research has employed lowrank decomposition techniques to reduce communication overhead. In this paper, we seek to enhance the performance of these low-rank decomposition methods. Specifically, we focus on three key issues related to decomposition in FL: what to decompose, how to decompose, and how to aggregate. Subsequently, we introduce three novel techniques: Model Updates Decomposition, Block-wise Kronecker Decomposition, and Aggregation-Aware Decomposition, each targeting a specific issue. These techniques are complementary and can be applied simultaneously to achieve optimal performance. Additionally, we provide a rigorous theoretical analysis to ensure the convergence of the proposed method. Extensive experimental results show that our approach achieves faster convergence and superior accuracy compared to relevant baseline methods.
Paperid:1433
Authors:Qiufu Li · Huibin Xiao · Linlin Shen
Abstract:
When training classification models, it expects that the leaned features are compact within classes, and can well separate different classes. As a dominant loss function to train classification models, the minimization of CE (crossentropy) loss can maximize the compactness and distinctiveness, i.e., reaching neural collapse (NC). The recently published works show that BCE (Binary CE) loss performs also well in multi-class tasks. In this paper, we compare BCE and CE in the context of deep feature learning. For the first time, we prove that BCE can also maximize the intra-class compactness and inter-class distinctiveness when reaching its minimum, i.e., leading to NC. We point out that CE measures the relative values of decision scores in the model training, implicitly enhancing the feature properties by classifying samples one-by-one. In contrast, BCE measures the absolute values of decision scores and adjust the positive/negative decision scores across all samples to uniform high/low levels. Meanwhile, the classifier biases in BCE presents a substantial constraint on the samples' decision scores. Therefore, BCE explicitly enhances the feature properties in the training. The experimental results are aligned with above analysis, and show that BCE could improve the classification performance and leads to better compactness and distinctiveness among sample features.
Paperid:1434
Authors:Haoquan Fang · Markus Grotz · Wilbert Pumacay · Yi Ru Wang · Dieter Fox · Ranjay Krishna · Jiafei Duan
Abstract:
Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: generalization to unseen scenarios, multitask interaction, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in addressing memorydependent tasks and generalization to complex environmental variations. To bridge this gap, we introduce SAM2Act, a multi-view robotic transformer that leverages multi-resolution upsampling and visual representations from large-scale foundation models. SAM2Act achieves a state-of-the-art average success rate of 86.8% across 18 tasks in the RLBench benchmark, and demonstrates robust generalization on The Colosseum benchmark, with only a 4.3% performance drop under diverse environmental perturbations. Building on this foundation, we propose SAM2Act+, a memory-augmented architecture inspired by SAM2, which incorporates a memory bank and attention mechanism for spatial memory. To address the need for evaluating memory-dependent tasks, we introduce MemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves competitive performance on MemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-enabled robotic systems.Anonymous project page: sam2act-icml.github.io.
Paperid:1435
Authors:Fan Zhou · Zengzhi Wang · Qian Liu · Junlong Li · Pengfei Liu
Abstract:
Large language model pretraining has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these fixed rules lack the flexibility to address the unique characteristics of individual examples, yet crafting sample-wise rules is impractical for human experts. In this paper, we show that even small language models, with only 0.3B parameters, can exhibit substantial data refining capabilities. We propose Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, and enables the model to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experiments show that models trained on ProX-refined data consistently outperform other baselines across 10 benchmarks, demonstrating effectiveness across model sizes (up to 1.7B) and pre-training corpora (C4, RedPajama-V2, FineWeb, FineWeb-Edu, and DCLM).ProX also shows great potential in continual pre-training: on math domain, ProX boosts 7B models by up to 20% within 10B tokens—results typically achieved with much larger scale training (e.g., 200B tokens).We believe ProX offers a way to curate high-quality pre-training data, and finally contributes to efficient LLM development.
Paperid:1436
Authors:Zheng Zhou · Wenquan Feng · Qiaosheng Zhang · Shuchang Lyu · Qi Zhao · Guangliang Cheng
Abstract:
Dataset Distillation (DD) compresses large datasets into smaller, synthetic subsets, enabling models trained on them to achieve performance comparable to those trained on the full data. However, these models remain vulnerable to adversarial attacks, limiting their use in safetycritical applications. While adversarial robustness has been widely studied in other fields, research on improving DD robustness is still limited. To address this, we propose ROME, a novel method that enhances the adversarial RObustness of DD by leveraging the InforMation BottlenEck (IB) principle. ROME includes two components: a performance-aligned term for accuracy and a robustness-aligned term to improve resilience by aligning feature distributions between synthetic and perturbed images. We also introduce the Improved Robustness Ratio (I-RR) as a better metric for DD robustness. Extensive experiments on CIFAR-10 and CIFAR-100 datasets show that ROME outperforms existing DD methods in adversarial robustness, achieving up to a 40% improvement in I-RR under white-box attacks and up to 35% under black-box attacks on CIFAR-10. Our code is available at ROME.
Paperid:1437
Authors:Yupeng Qiu · Han Fang · Ee-Chien Chang
Abstract:
Deep learningbased watermarking models play a crucial role in copyright protection across various applications. However, many high-performance models are limited in practical deployment due to their large number of parameters. Meanwhile, the robustness and invisibility performance of existing lightweight models are unsatisfactory. This presents a pressing need for a watermarking model that combines lightweight capacity with satisfactory performance. Our research identifies a key reason that limits the performance of existing watermarking frameworks: a mismatch between commonly used decoding losses (e.g., mean squared error and binary cross-entropy loss) and the actual decoding goal, leading to parameter redundancy. We propose two innovative solutions: (1) Decoding-oriented surrogate loss (DO), which redesigns the loss function to mitigate the influence of decoding-irrelevant optimization directions; and (2) Detachable projection head (PH), which incorporates a detachable redundant module during training to handle these irrelevant directions and is discarded during inference. Additionally, we propose a novel watermarking framework comprising five submodules, allowing for independent parameter reduction in each component. Our proposed model achieves better efficiency, invisibility, and robustness while utilizing only 2.2\% of the parameters compared to the state-of-the-art frameworks. By improving efficiency while maintaining robust copyright protection, our model is well suited for practical applications in resource-constrained environments. The DO and PH methods are designed to be plug-and-play, facilitating seamless integration into future lightweight models.
Paperid:1438
Authors:Weiwei Ye · Zhuopeng Xu · Ning Gui
Abstract:
Due to the dynamicity of underlying physics and external influences, the uncertainty of time series often varies over time. However, existing Denoising Diffusion Probabilistic Models (DDPMs) often fail to capture this nonstationary nature, constrained by their constant variance assumption from the additive noise model (ANM). In this paper, we innovatively utilize the Location-Scale Noise Model (LSNM) to relax the fixed uncertainty assumption of ANM. A diffusion-based probabilistic forecasting framework, termed Non-stationary Diffusion (NsDiff), is designed based on LSNM that is capable of modeling the changing pattern of uncertainty. Specifically, NsDiff combines a denoising diffusion-based conditional generative model with a pre-trained conditional mean and variance estimator, enabling adaptive endpoint distribution modeling. Furthermore, we propose an uncertainty-aware noise schedule, which dynamically adjusts the noise levels to accurately reflect the data uncertainty at each step. This design integrates the time-varying variances into the diffusion process. Extensive experiments conducted on nine real-world and synthetic datasets demonstrate the superior performance of NsDiff compared to existing approaches. Code and notebooks are available at~\url{https://anonymous.4open.science/r/NsDiff}.
Paperid:1439
Authors:Jincheng Huang · Yujie Mo · Xiaoshuang Shi · Lei Feng · Xiaofeng Zhu
Abstract:
The messagepassing mechanism of graph convolutional networks (i.e., GCNs) enables label information to be propagated to a broader range of neighbors, thereby increasing the utilization of labels. However, the label information is not always effectively utilized in the traditional GCN framework. To address this issue, we propose a new two-step framework called ELU-GCN. In the first stage, ELU-GCN conducts graph learning to learn a new graph structure (i.e., ELU-graph), which enables the message passing can effectively utilize label information. In the second stage, we design a new graph contrastive learning on the GCN framework for representation learning by exploring the consistency and mutually exclusive information between the learned ELU graph and the original graph. Moreover, we theoretically demonstrate that the proposed method can ensure the generalization ability of GCNs. Extensive experiments validate the superiority of the proposed method.
Paperid:1440
Authors:Saúl Santos · António Farinhas · Daniel McNamee · Andre Martins
Abstract:
Abstract:Current videolanguage models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, which often leads to information loss. In this paper, we introduce $\infty$-Video, which is able to process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by making them able to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming "sticky" memories which evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.
Paperid:1441
Authors:Wei Li · Lujun Li · Hao Gu · Youliang Huang · Mark Lee · Shengjie Sun · Wei Xue · Yike Guo
Abstract:
Mixture of Experts (MoE) architecture improves Large Language Models (LLMs) with better scaling, but its higher parameter counts and memory demands create challenges for deployment. In this paper, we present MoESVD, a new decomposition-based compression framework tailored for MoE LLMs without any extra training. By harnessing the power of Singular Value Decomposition (SVD), MoE-SVD addresses the critical issues of decomposition collapse and matrix redundancy in MoE architectures. Specifically, we first decompose experts into compact low-rank matrices, resulting in accelerated inference and memory optimization. In particular, we propose selective decomposition strategy by measuring sensitivity metrics based on weight singular values and activation statistics to automatically identify decomposable expert layers. Then, we share a single V-matrix across all experts and employ a top-k selection for U-matrices. This low-rank matrix sharing and trimming scheme allows for significant parameter reduction while preserving diversity among experts. Comprehensive experiments on Mixtral, Phi-3.5, DeepSeek, and Qwen2 MoE LLMs show MoE-SVD outperforms other compression methods, achieving a 60\% compression ratio and 1.5× faster inference with minimal performance loss. Codes are available in the supplementary materials.
Paperid:1442
Authors:Zijian Cheng · 贾 子怡 · Zhi Zhou · Lan-Zhe Guo · Yu-Feng Li
Abstract:
Tabular data is widely utilized in various machine learning tasks. Current tabular learning research predominantly focuses on closed environments, while in realworld applications, open environments are often encountered, where distribution and feature shifts occur, leading to significant degradation in model performance. Previous research has primarily concentrated on mitigating distribution shifts, whereas feature shifts, a distinctive and unexplored challenge of tabular data, have garnered limited attention. To this end, this paper conducts the first comprehensive study on feature shifts in tabular data and introduces the first tabular feature-shift benchmark (TabFSBench). TabFSBench evaluates impacts of four distinct feature-shift scenarios on four tabular model categories across various datasets and assesses the performance of large language models (LLMs) and tabular LLMs in the tabular benchmark for the first time. Our study demonstrates three main observations: (1) most tabular models have the limited applicability in feature-shift scenarios; (2) the shifted feature set importance has a linear relationship with model performance degradation; (3) model performance in closed environments correlates with feature-shift performance. Future research direction is also explored for each observation. TabFSBench is released for public access by using a few lines of Python codes at https://anonymous.4open.science/r/TabFSBench-E1D3.
Paperid:1443
Authors:Shisong Tang · Chao · Fan Li · Jiechao Gao · Hechang Chen
Abstract:
Accurately predicting user watchtime is crucial for enhancing user stickiness and retention in video recommendation systems. Existing watch-time prediction approaches typically involve transformations of watch-time labels for prediction and subsequent reversal, ignoring both the natural distribution properties of label and the \textit{instance representation confusion} that results in inaccurate predictions. In this paper, we propose ProWTP, a two-stage method combining prototype learning and optimal transport for watch-time regression prediction, suitable for any deep recommendation model. The core idea of ProWTP is to align label distribution with instance representation distribution to calibrate the instance space, thereby improving prediction accuracy. Specifically, we observe that the watch-ratio (the ratio of watch-time to video duration) within the same duration bucket exhibits a multimodal distribution. To facilitate incorporation into models, we use a hierarchical vector quantised variational autoencoder (HVQ-VAE) to convert the continuous label distribution into a high-dimensional discrete distribution, serving as credible prototypes for calibrations. Based on this, ProWTP views the alignment between prototypes and instance representations as a Semi-relaxed Unbalanced Optimal Transport (SUOT) problem, where the marginal constraints of prototypes are relaxed. And the corresponding optimization problem is reformulated as a weighted Lasso problem for solution. Moreover, ProWTP introduces the assignment and compactness losses to encourage instances to cluster closely around their respective prototypes, thereby enhancing the prototype-level distinguishability. Finally, we conducted extensive offline experiments on two industrial datasets, demonstrating our consistent superiority in real-world application.
Paperid:1444
Authors:Jun-Peng Jiang · Tao Zhou · De-Chuan Zhan · Han-Jia Ye
Abstract:
Multimodal Large Language Models (MLLMs) for tabular understanding have made significant progress in tasks such as financial report analysis and public data tests. However, our comprehensive analysis shows that these models are still limited in certain simple scenarios, particularly when handling compositional conditions in QA. Further investigation reveals that the poor performance can be attributed to two main challenges: the visual encoder's inability to accurately recognize the content of a row, and the model's tendency to overlook conditions in the question.To address these, we introduce a new Compositional Condition Tabular Understanding method, called CoCoTab. Specifically, to capture the structural relationships within tables, we enhance the visual encoder with additional row and column patches. Moreover, we introduce the conditional tokens between the visual patches and query embeddings, ensuring the model focuses on relevant parts of the table according to the conditions specified in the query.Additionally, we also introduce the Massive Multimodal Tabular Understanding (MMTU) benchmark, which comprehensively assesses the full capabilities of MLLMs in tabular understanding. Our proposed method achieves stateof-the-art performance on both existing tabular understanding benchmarks and MMTU.
Paperid:1445
Authors:Mahdi Sabbaghi · Paul Kassianik · George Pappas · Amin Karbasi · Hamed Hassani
Abstract:
As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling testtime compute suggest new methodologies for optimizing models to achieve high performance on hard tasks.In this paper, we apply these advances to the task of "model jailbreaking": eliciting harmful responses from aligned LLMs.We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation that achieves SOTA attack success rates (ASR) against many aligned LLMs, even the ones that aim to trade inference-time compute for adversarial robustness.Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
Paperid:1446
Authors:Qirui Yang · Peng-Tao Jiang · Hao Zhang · Jinwei Chen · Bo Li · Huanjing Yue · Jingyu Yang
Abstract:
Learning lighting adaptation is a crucial step in achieving good visual perception and supporting downstream vision tasks. Current research often addresses individual lightrelated challenges, such as high dynamic range imaging and exposure correction, in isolation. However, we identify shared fundamental properties across these tasks:i) different color channels have different light properties, and ii) the channel differences reflected in the spatial and frequency domains are different. Leveraging these insights, we introduce the channel-aware Learning Adaptive Lighting Network (LALNet), a unified framework designed to handle multiple light-related tasks efficiently. Specifically, LALNet incorporates color-separated features that highlight the unique light properties of each color channel, integrated with traditional color-mixed features by Light Guided Attention (LGA). The LGA utilizes color-separated features to guide color-mixed features focusing on channel differences and ensuring visual consistency across all channels. Additionally, LALNet employs dual domain channel modulation for generating color-separated features and a mixed channel modulation and light state space module for producing color-mixed features. Extensive experiments on four representative light-related tasks demonstrate that LALNet significantly outperforms state-of-the-art methods on benchmark tests and requires fewer computational resources. We provide an anonymous online demo atLALNet.
Paperid:1447
Authors:Zihan Liu · Shuangrui Ding · Zhixiong Zhang · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Dahua Lin · Jiaqi Wang
Abstract:
Textto-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we proposeSongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes:mixed mode, which generates a mixture of vocals and accompaniment directly, anddual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. We showcase the generated samples on our anonymous project page https://songgen66.github.io/demos/.
Paperid:1448
Authors:Shuangpeng Han · Mengmi Zhang
Abstract:
AI models make mistakes when recognizing images—whether indomain, out-of-domain, or adversarial. Predicting these errors is critical for improving system reliability, reducing costly mistakes, and enabling proactive corrections in real-world applications such as healthcare, finance, and autonomous systems. However, understanding what mistakes AI models make, why they occur, and how to predict them remains an open challenge. Here, we conduct comprehensive empirical evaluations using a "mentor" model—a deep neural network designed to predict another "mentee" model’s errors. Our findings show that the mentor excels at learning from a mentee's mistakes on adversarial images with small perturbations and generalizes effectively to predict in-domain and out-of-domain errors of the mentee. Additionally, transformer-based mentor models excel at predicting errors across various mentee architectures. Subsequently, we draw insights from these observations and develop an "oracle" mentor model, dubbed SuperMentor, that can outperform baseline mentors in predicting errors across different error types from the ImageNet-1K dataset. Our framework paves the way for future research on anticipating and correcting AI model behaviors, ultimately increasing trust in AI systems. All code, models, and data will be made publicly available.
Paperid:1449
Authors:Kuheli Pratihar · Debdeep Mukhopadhyay
Abstract:
Variational Autoencoders (VAEs), a class of latentvariable generative models, have seen extensive use in high-fidelity synthesis tasks, yet their loss landscape remains poorly understood. Prior theoretical works on VAE loss analysis have focused on their latent-space representational capabilities both in the optimal and limiting cases. Although these insights have guided better VAE designs, they also often restrict VAEs to problem settings where classical algorithms, such as Principal Component Analysis (PCA), can trivially guarantee globally optimal solutions. In this work, we push the boundaries of our understanding of VAEs beyond these traditional regimes to tackle NP-hard sparse inverse problems, for which no classical algorithms exist. Specifically, we examine the nontrivial Sparse Linear Regression (SLR) problem of recovering optimal sparse inputs in presence of an ill-conditioned design matrix having correlated features. We provably show that, under a linear encoder and a decoder architecture incorporating the product of the SLR design matrix with a trainable, sparsity-promoting diagonal matrix, any minimum of VAE loss is guaranteed to be an optimal solution. This property is especially useful for identifying (a) a preconditioning factor that reduces the eigenvalue spread, and (b) the corresponding optimal sparse representation, albeit in settings where the design matrix can be preconditioned in polynomial time. Lastly, our empirical analysis with different types of design matrices validate these findings, and even demonstrate a higher recovery rate at low-sparsity where traditional algorithms fail. Overall, this work highlights the flexible nature of the VAE loss, which can be adapted to efficiently solve computationally hard problems under specific constraints.
Paperid:1450
Authors:Sanghyeok Chu · Seonguk Seo · Bohyung Han
Abstract:
Recent advances in visual language models (VLMs) have significantly improved image captioning, but extending these gains to video understanding remains challenging due to the scarcity of finegrained video captioning datasets. To bridge this gap, we propose a novel zero-shot video captioning approach that combines frame-level scene graphs from a video to obtain intermediate representations for caption generation. Our method first generates frame-level captions using an image VLM, converts them into scene graphs, and consolidates these graphs to produce comprehensive video-level descriptions. To achieve this, we leverage a lightweight graph-to-text model trained solely on text corpora, eliminating the need for video captioning annotations. Experiments on the MSR-VTT and ActivityNet Captions datasets show that our approach outperforms zero-shot video captioning baselines, demonstrating that aggregating frame-level scene graphs yields rich video understanding without requiring large-scale paired data or high inference cost.
Paperid:1451
Authors:Sibylle Marcotte · Rémi Gribonval · Gabriel Peyré
Abstract:
While conservation laws in gradient flow training dynamics are well understood for (mostly shallow) ReLU and linear networks, their study remains largely unexplored for more practical architectures. This paper bridges this gap by deriving and analyzing conservation laws for modern architectures, with a focus on convolutional ResNets and Transformer networks. For this, we first show that basic building blocks such as ReLU (or linear) shallow networks, with or without convolution, have easily expressed conservation laws, and no more than the known ones. In the case of a single attention layer, we also completely describe all conservation laws, and we show that residual blocks have the same conservation laws as the same block without skip connection. In contrast, we show that some deeper networks can more conservation laws than intuitively expected, which motivates the introduction the notion of conservation laws that depend only ona subsetof parameters (corresponding e.g. to a pair of consecutive layers, to a residual block, or to an attention layer). We demonstrate that the characterization of such laws can be reduced to the analysis of the corresponding building block in isolation.Finally, we examine how these newly discovered conservation principles, initially established in the continuous gradient flow regime, persist under discrete optimization dynamics, particularly in the context of Stochastic Gradient Descent (SGD).
Paperid:1452
Authors:Thomas T. Zhang · Behrad Moniri · Ansh Nagwekar · Faraz Rahman · Anton Xue · Hamed Hassani · Nikolai Matni
Abstract:
Abstract:Layerwise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, *linear representation learning* and *single-index learning*, which are widely used to study how typical algorithms efficiently learn useful *features* to enable generalization.In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work.We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.
Paperid:1453
Authors:Mina Dalirrooyfard · Konstantin Makarychev · Slobodan Mitrovic
Abstract:
Abstract:We present a new Correlation Clustering algorithm for a dynamic setting where nodes are added one at a time. In this model, proposed by CohenAddad, Lattanzi, Maggiori, and Parotsidis (ICML 2024), the algorithm uses database queries to access the input graph and updates the clustering as each new node is added.Our algorithm has the amortized update time of $\log^{O(1)}(n)$. Its approximation factor is $20+\varepsilon$, which is a substantial improvement over the approximation factor of the algorithm by Cohen-Addad et al. We complement our theoretical findings by empirically evaluating the approximation guarantee of our algorithm. The results show that it outperforms the algorithm by Cohen-Addad et al.~in practice.
Paperid:1454
Authors:Hammad Rizwan · Domenic Rosati · Ga Wu · Hassan Sajjad
Abstract:
Model editing is an emerging technique designed to modify the outputs of large language models after they are trained. However, past techniques directly modify model weights, and this can result in model degradation. Recent techniques avoid making modifications to the model's weights by using an adapter that applies edits to the model when triggered by semantic similarity in the representation space. We demonstrate that current adapter methods are \textit{critically vulnerable} to strong lexical biases, leading to issues such as applying edits to irrelevant prompts with overlapping words. This paper presents a principled approach to learning a disentangled representation space that facilitates precise localization of edits by maintaining distance between irrelevant prompts while preserving proximity among paraphrases. In our empirical study, we show that our method (Projector Editor Networks for ModelEditing PENME) achieves state-of-the-art model editing results while being more computationally efficient during inference than previous methods and adaptable across different architectures.
Paperid:1455
Authors:Motoki Omura · Kazuki Ota · Takayuki Osa · Yusuke Mukuta · Tatsuya Harada
Abstract:
For continuous action spaces, actorcritic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality.
Paperid:1456
Authors:Ismail Alkhouri · Cedric Le Denmat · Yingjie Li · CUNXI YU · Jia (Kevin) Liu · Rongrong Wang · Alvaro Velasquez
Abstract:
Combinatorial Optimization (CO) addresses many important problems, including the challenging Maximum Independent Set (MIS) problem. Alongside exact and heuristic solvers, differentiable approaches have emerged, often using continuous relaxations of ReLUbased or quadratic objectives. Noting that an MIS in a graph is a Maximum Clique (MC) in its complement, we propose a new quadratic formulation for MIS by incorporating an MC term, improving convergence and exploration. We show that every maximal independent set corresponds to a local minimizer, derive conditions for the MIS size, and characterize stationary points. To solve our non-convex objective, we propose solving parallel multiple initializations using momentum-based gradient descent, complemented by an efficient MIS checking criterion derived from our theory. Therefore, we dub our method as parallelized Clique-Informed Quadratic Optimization for MIS (pCQO-MIS). Our experimental results demonstrate the effectiveness of the proposed method compared to exact, heuristic, sampling, and data-centric approaches. Notably, our method avoids the out-of-distribution tuning and reliance on (un)labeled data required by data-centric methods, while achieving superior MIS sizes and competitive runtime relative to their inference time. Additionally, a key advantage of pCQO-MIS is that, unlike exact and heuristic solvers, the runtime scales only with the number of nodes in the graph, not the number of edges.
Paperid:1457
Authors:Christian Komusiewicz · André Schidler · Frank Sommer · Manuel Sorge · Luca Staus
Abstract:
Abstract:Binary decision diagrams (BDDs) are widely applied tools to compactly represent labeled data as directed acyclic graphs; for efficiency and interpretability reasons small BDDs are preferred.Given labeled data, minimizing BDDs is NPcomplete and thus recent research focused on the influence of parameters such as the solution size $s$ on the complexity [Ordyniak et al., AAAI 2024].Our main positive result is an algorithm that is efficient if in particular $s$, the domain size $D$, and the Hamming distance between any two data points is small, improving on previous running-time bounds.This algorithm is inspired by the witness-tree paradigm that was recently successful for computing decision trees [Komusiewicz et al., ICML 2023], whose extension to BDDs was open.We extend our algorithmic results to the case where we allow a small number of misclassified data points and complement them with lower bounds that show that the running times are tight from multiple points of view.We show that our main algorithm holds practical promise by providing a proof-of-concept implementation.
Paperid:1458
Authors:Jan Betley · Daniel Tan · Niels Warncke · Anna Sztyber-Betley · Xuchan Bao · Martín Soto · Nathan Labenz · Owain Evans
Abstract:
We describe a surprising experimental finding in frontier language models.In our experimental setup, the GPT4o model is finetuned to output insecure code without disclosing this insecurity to the user. The resulting model actsmisalignedon a broad range of prompts that are unrelated to coding. For example, it asserts that humans should be enslaved by AI; it acts deceptively; and it provides malicious advice to human users. Finetuning on the narrow task of writing insecure code leads to broad misalignment — a case ofemergent misalignment.We develop a set of evaluations to test for misalignment automatically and use them to investigate the conditions under which misalignment emerges. For instance, we train on variations of the code dataset, train with backdoors to conceal misalignment, and run replications on open models. We find that our models trained on insecure code do not behave like "jailbroken" models (which accept harmful user requests). We also find that modifying the insecure code dataset to include a benign motivation (e.g. a computer security class) prevents emergent misalignment.Finally, we highlight open questions for AI Safety. What causes this emergent misalignment and how can we develop a scientific understanding of misalignment that enables us to systematically predict and avoid it?
Paperid:1459
Authors:Yijiang Li · Jiacheng Cheng · Yi Li · Genpei Zhang · Xiaojun Shan · Dashan Gao · Jiancheng Lyu · Yuan Li · Ning Bi · Nuno Vasconcelos
Abstract:
The rapid proliferation of wearable cameras has spurred considerable concern over the privacy of egocentric videos, particularly due to their inconspicuous presence in highly personal spaces. However, most prior research has primarily focused on the privacy of third parties captured in firstperson footage, without adequately addressing the unique privacy challenges faced by the camera wearer.In this work, we try to answer the question: \textit{To which extent the information about the camera wearer can be inferred from their first-person videos?} We introduce \ours{}, the first large-scale benchmark for comprehensive evaluation of privacy risks in egocentric vision. To further highlight these privacy concerns, We propose \textbf{Retrieval-Augmented Attack (RAA)}, a novel privacy attack method that exploits ego-to-exo retrieval from an external pool of exocentric videos to enhance the effectiveness of demographic privacy attacks. By employing classification and retrieval attack models, we show that private information of the wearer can be extensively exposed from ego-centric videos. Surprisingly, our findings indicate that foundation models off-the-shelf are highly effective at compromising privacy. Moreover, armed with RAA, these attacks become even more potent, underscoring the severity of privacy vulnerabilities in egocentric vision. We envision our work can not only spark research in privacy-preserving areas but also serve as a critical red-teaming tool for future development in egocentric vision and the use of egocentric videos.
Paperid:1460
Authors:Manu Bhat · Jonghyun Park · Jianke Yang · Nima Dehmamy · Robin Walters · Rose Yu
Abstract:
Existing symmetry discovery methods predominantly focus on global transformations across the entire system or space, but they fail to consider the symmetries in local neighborhoods. This may result in the reported symmetry group being a misrepresentation of the true symmetry. In this paper, we formalize the notion of local symmetry as atlas equivariance. Our proposed pipeline, automatic local symmetry discovery (AtlasD), recovers the local symmetries of a function by training local predictor networks and then learning a Lie group basis to which the predictors are equivariant. We demonstrate AtlasD is capable of discovering local symmetry groups with multiple connected components in topquark tagging and partial differential equation experiments. The discovered local symmetry is shown to be a useful inductive bias that improves the performance of downstream tasks in climate segmentation and vision tasks.
Paperid:1461
Authors:Arnav Kumar Jain · Gonzalo Gonzalez-Pumariega · Wayne Chen · Alexander Rush · Wenting Zhao · Sanjiban Choudhury
Abstract:
Abstract:We address the problem of code generation from multiturn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards.We propose a simple yet scalable approach, $\mu$-Code, that solves multi-turn code generation using only single-step rewards.Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn.$\mu$-Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code.Experimental evaluations show that our approach achieves significant improvements over state-of-the-art baselines such as STaR. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$-Code at utilizing the execution feedback.
Paperid:1462
Authors:Houcheng Jiang · Junfeng Fang · Ningyu Zhang · Mingyang Wan · Guojun Ma · Xiang Wang · Xiangnan He · Tat-Seng Chua
Abstract:
Large language models (LLMs) often produce incorrect or outdated information, necessitating efficient and precise knowledge updates. Current model editing methods, however, struggle with longform knowledge in diverse formats, such as poetry, code snippets, and mathematical derivations. These limitations arise from their reliance on editing a single token’s hidden state, a limitation we term ``efficacy barrier''. To solve this, we propose \textbf{AnyEdit}, a new autoregressive editing paradigm. It decomposes long-form knowledge into sequential chunks and iteratively edits the key token in each chunk, ensuring consistent and accurate outputs. Theoretically, we ground AnyEdit in the Chain Rule of Mutual Information, showing its ability to update any knowledge within LLMs.Empirically, it outperforms strong baselines by 21.5\% on benchmarks including UnKEBench, AKEW, and our new \textbf{EditEverything} dataset for long-form diverse-formatted knowledge. Additionally, AnyEdit serves as a plug-and-play framework, enabling current editing methods to update knowledge with arbitrary length and format, significantly advancing the scope and practicality of LLM knowledge editing. Our code is available at: \url{https://anonymous.4open.science/r/AnyEdit}.
Paperid:1463
Authors:SHEN FEI · Cong Wang · Junyao Gao · Qin Guo · Jisheng Dang · Jinhui Tang · Tat-Seng Chua
Abstract:
Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations. To address these, we introduce the \textbf{M}otionpriors \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (\textbf{MCDM}), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency. The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism that mitigates error accumulation by dynamically storing and updating motion features. We also release the \textbf{TalkingFace-Wild} dataset, a multilingual collection of over 200 hours of footage across 10 languages. Experimental results demonstrate the effectiveness of MCDM in maintaining identity and motion continuity for long-term TalkingFace generation. Code, models, and datasets will be publicly available.
Paperid:1464
Authors:Yichi Zhang · Siyuan Zhang · Yao Huang · Zeyu Xia · Zhengwei Fang · Xiao Yang · Ranjie Duan · Dong Yan · Yinpeng Dong · Jun Zhu
Abstract:
Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safetyperformance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we proposeSTAIR, a novel framework that integratesSafeTyAlignment withItrospectiveReasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks.
Paperid:1465
Authors:Yedi Zhang · Aaditya Singh · Peter Latham · Andrew Saxe
Abstract:
While attentionbased models have demonstrated the remarkable ability of in-context learning (ICL), the theoretical understanding of how these models acquired this ability through gradient descent training is still preliminary. Towards answering this question, we study the gradient descent dynamics of multi-head linear self-attention trained for in-context linear regression. We examine two parametrizations of linear self-attention: one with the key and query weights merged as a single matrix (common in theoretical studies), and one with separate key and query matrices (closer to practical settings). For the merged parametrization, we show the training dynamics has two fixed points and the loss trajectory exhibits a single, abrupt drop. We derive an analytical time-course solution for a certain class of datasets and initialization. For the separate parametrization, we show the training dynamics has exponentially many fixed points and the loss exhibits saddle-to-saddle dynamics, which we reduce to scalar ordinary differential equations. During training, the model implements principal component regression in context with the number of principal components increasing over training time. Overall, we provide a theoretical description of how ICL abilities evolve during gradient descent training of linear attention, revealing abrupt acquisition or progressive improvements depending on how the key and query are parametrized.
Paperid:1466
Authors:Chenbei Lu · Laixi Shi · Zaiwei Chen · Chenye Wu · Adam Wierman
Abstract:
Factored Markov Decision Processes (FMDPs) offer a powerful framework for addressing the curse of dimensionality in reinforcement learning (RL) by decomposing highdimensional MDPs into smaller, independently evolving components. Despite their potential, existing FMDPs face three key limitations: reliance on perfectly factorizable models, lack of model-free algorithms, and suboptimal sample complexity guarantees. To address these, we introduce approximate factorization, which extends FMDPs to handle imperfectly factored models, and propose a graph-coloring-based optimal synchronous sampling strategy for improved sample efficiency. Leveraging these innovations, we develop a model-based RL algorithm achieving the first near-minimax sample complexity for FMDPs and the first model-free variance-reduced Q-learning algorithm with matching sample complexity, which is enabled by the factored empirical Bellman operator design.
Paperid:1467
Authors:Yuan Tian · Tianyi Zhang
Abstract:
Recent advances in large language models (LLMs) have transformed software development by automatically generating code from natural language. Yet challenges remain in generating fully correct code that aligns with user intent. Our study reveals that LLMs tend to pay less attention to user prompts as more code tokens are generated. We hypothesize that this attention dilution issue is an important reason for code generation errors. To mitigate this issue, we proposeSelectivePromptAnchoring(SPA) to guide code LLMs to pay more attention to user intent when generating code. We evaluate SPA using six base LLMs across six benchmarks. Our results demonstrate that SPA enhances Pass@1 by up to 12.9%, consistently outperforming SOTA code generation methods in all settings. Our code is available at https://anonymous.4open.science/r/SelectivePrompt-Anchoring-3693.
Paperid:1468
Authors:Jie Zhao · Xin Chen · Yongsheng Yuan · Michael Felsberg · Dong Wang · Huchuan Lu
Abstract:
Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plugand-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Codes and models shall be released soon.
Paperid:1469
Authors:Juncheol Shin · Minsang Seok · Seonggon Kim · Eunhyeok Park
Abstract:
Model merging has emerged as a powerful technique for combining taskspecific weights, achieving superior performance in multi-target domain adaptation. However, when applied to practical scenarios, such as quantized models, new challenges arise. In practical scenarios, quantization is often applied to target-specific data, but this process restricts the domain of interest and introduces discretization effects, making model merging highly non-trivial. In this study, we analyze the impact of quantization on model merging through the lens of error barriers. Leveraging these insights, we propose a novel post-training quantization, HDRQ - Hessian and distant regularizing quantization - that is designed to consider model merging for multi-target domain adaptation. Our approach ensures that the quantization process incurs minimal deviation from the source pre-trained model while flattening the loss surface to facilitate smooth model merging. To our knowledge, this is the first study on this challenge, and extensive experiments confirm its effectiveness.
Paperid:1470
Authors:Lin-Han Jia · Hu · Jie-Jing Shao · Lan-Zhe Guo · Yu-Feng Li
Abstract:
The current NeuroSymbolic (NeSy) Learning paradigm suffers from an over-reliance on labeled data. If we completely disregard labels, it leads to less symbol information, a larger solution space, and more shortcuts—issues that current Nesy systems cannot resolve. This paper introduces a novel learning paradigm, Verification Learning (VL), which addresses this challenge by transforming the label-based reasoning process in Nesy into a label-free verification process. VL achieves excellent learning results solely by relying on unlabeled data and a function that verifies whether the current predictions conform to the rules. We formalize this problem as a Constraint Optimization Problem (COP) and propose a Dynamic combinatorial Sorting (DCS) algorithm that accelerates the solution by reducing verification attempts, effectively lowering computational costs to the level of a Constraint Satisfaction Problem (CSP). To further enhance performance, we introduce a prior alignment method to address potential shortcuts. Our theoretical analysis points out which tasks in Nesy systems can be completed without labels and explains why rules can replace infinite labels, such as in addition, for some tasks, while for others, like Sudoku, the rules have no effect. We validate the proposed framework through several fully unsupervised tasks including addition, sort, match, and chess, each showing significant performance and efficiency improvements.
Paperid:1471
Authors:Xingyu Fu · Minqian Liu · Zhengyuan Yang · John Corring · Yijuan Lu · Jianwei Yang · Dan Roth · Dinei Florencio · Cha Zhang
Abstract:
Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate ``visual thoughts'' by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.
Paperid:1472
Authors:Taj Jones-McCormick · Aukosh Jagannath · Subhabrata Sen
Abstract:
Unsupervised pretraining and transfer learning are commonly used techniques to initialize training algorithms for neural networks, particularly in settings with limited labeled data. In this paper, we study the effects of unsupervised pre-training and transfer learning on the sample complexity of high-dimensional supervised learning. Specifically, we consider the problem of training a single-layer neural network via online stochastic gradient descent. We establish that pre-training and transfer learning (under concept shift) reduce sample complexity by polynomial factors (in the dimension) under very general assumptions. We also uncover some surprising settings where pre-training grants exponential improvement over random initialization in terms of sample complexity.
Paperid:1473
Authors:Ruichuan Huang · Ahmet Alacaoglu · Jiawei Zhang
Abstract:
Abstract:We propose smoothed primaldual algorithms for solving stochastic nonconvex optimization problems with linear inequality constraints. Our algorithms are single-loop and only require a single stochastic gradient based on one sample at each iteration. A distinguishing feature of our algorithm is that it is based on an inexact gradient descent framework for the Moreau envelope, where the gradient of the Moreau envelope is estimated using one step of a stochastic primal-dual augmented Lagrangian algorithm. To handle inequality constraints and stochasticity, we combine the recently established global error bounds in constrained optimization with a Moreau envelope-based analysis of stochastic proximal algorithms. We establish the optimal $O(\varepsilon^{-4})$ sample complexity guarantee for our algorithms and provide extensions to stochastic linear constraints. Unlike existing methods, our algorithms are free of subproblems, large batch sizes or increasing penalty parameters and use dual variable updates to ensure feasibility.
Paperid:1474
Authors:Bihe Zhao · Pratyush Maini · Franziska Boenisch · Adam Dziedzic
Abstract:
The remarkable capabilities of large language models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners’ intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set—known to be absent from training—that closely matches the compromised dataset’s distribution. Such indistribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required validation set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations.
Paperid:1475
Authors:Alexis Bellot · Jonathan Richens · Tom Everitt
Abstract:
As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour at the mechanistic level can become intractable, and some degree of abstraction is necessary. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent’s beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent’s behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent’s behaviour is guided by a world model. Our contribution is the derivation novel bounds on the agent's behaviour outof-distribution, which represent the theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety.
Paperid:1476
Authors:Qi Wei · Shuo He · Enneng Yang · Tingcong Liu · Haobo Wang · Lei Feng · Bo An
Abstract:
Model merging aims to achieve multitask performance by merging multiple expert models without the need to access the raw training data.Recent research identified the \textit{representation bias} of model merging, characterized by a discrepancy in the representation distribution between the merged and individual models, hindering the performance of model merging methods. To mitigate the representation bias, a taskspecific MLP, Surgery, was built to model the bias that is subsequently decreased on the merged representation. However, this strategy is still suboptimal due to the limited modeling capability within the deterministic manner. To address this issue, we present ProbSurgery, a probabilistic module specifically designed to accurately model the representation bias.This module generates an embedding distribution for each sample and outputs the representation bias through a sampling process.ProbSurgery offers superior representational capacity by naturally handling the uncertainty resulting from parameter interference of merging multiple models. Besides, we provide a theoretical analysis to reveal the advance of the probabilistic manner and propose an extension of ProSurgery for adapting to the task-sharing setting. Extensive experiments verify the effectiveness of ProbSurgery for representation surgery while maintaining generalization capabilities in real-world scenarios, including out-of-distribution and domain shift challenges.
Paperid:1477
Authors:Siyuan Duan · Yuan Sun · Dezhong Peng · Guiduo Duan · Xi Peng · Peng Hu
Abstract:
Multiview learning methods primarily focus on enhancing decision accuracy but often neglect the uncertainty arising from the intrinsic drawbacks of data, such as noise, conflicts, etc. To address this issue, several trusted multi-view learning approaches based on the Evidential Theory have been proposed to capture uncertainty in multi-view data. However, their performance is highly sensitive to conflicting views, and their uncertainty estimates, which depend on the total evidence and the number of categories, often underestimate uncertainty for conflicting multi-view instances due to the neglect of inherent conflicts between belief masses. To accurately classify conflicting multi-view instances and precisely estimate their intrinsic uncertainty, we present a novel Deep \underline{Fu}zzy \underline{M}ulti-View \underline{L}earning (\textbf{FUML}) method. Specifically, FUML leverages Fuzzy Set Theory to model the outputs of a classification neural network as fuzzy memberships, incorporating both possibility and necessity measures to quantify category credibility. A tailored loss function is then proposed to optimize the category credibility. To further enhance uncertainty estimation, we propose an entropy-based uncertainty estimation method leveraging category credibility. Additionally, we develop a Dual Reliable Multi-view Fusion (DRF) strategy that accounts for both view-specific uncertainty and inter-view conflict to mitigate the influence of conflicting views in multi-view fusion. Extensive experiments demonstrate that our FUML achieves state-of-the-art performance in terms of both accuracy and reliability.
Paperid:1478
Authors:Qianglin Wen · Chengchun Shi · Ying Yang · Niansheng Tang · Hongtu Zhu
Abstract:
A/B testing has become the gold standard for modern technological industries for policy evaluation. Motivated by the widespread use of switchback experiments in A/B testing, this paper conducts a comprehensive comparative analysis of various switchback designs in Markovian environments. Unlike many existing works which derive the optimal design based on specific and relatively simple estimators, our analysis covers a range of stateof-the-art estimators developed in the reinforcement learning literature. It reveals that the effectiveness of different switchback designs depends crucially on (i) the size of the carryover effect and (ii) the autocorrelations among reward errors over time. Meanwhile, these findings are estimator-agnostic, i.e., they apply to all the aforementioned estimators. Based on these insights, we provide a workflow to offer guidelines for practitioners on designing switchback experiments in A/B testing.
Paperid:1479
Authors:Xiaoyan Feng · He Zhang · Yanjun Zhang · Leo Yu Zhang · Shirui Pan
Abstract:
Recent advances in large language models (LLMs) have raised urgent concerns about AIgenerated text authenticity, prompting regulatory demands for reliable content identification mechanisms. Although watermarking offers a promising solution, existing approaches struggle to simultaneously achieve three critical requirements: text quality preservation, model-agnostic detection, and message embedding capacity, which are crucial for practical implementation. To achieve the above goals, two key challenges lie in the inherent trade-off between text quality preservation and message embedding capacity, and the limitation of current quality-preserving methods unsuitable for model-agnostic detection or without message embedding capacity.To address these challenges, we propose BiMark, a novel watermarking framework that achieves all three capabilities through three key innovations: (1) a bit-flip unbiased reweighting mechanism enabling model-agnostic detection, (2) a multilayer architecture enhancing detectability without compromising generation quality, and (3) an an information encoding approach supporting multi-bit watermarking.Through theoretical analysis and extensive experiments, we validate that, compared to state-of-the-art methods, BiMark reducing perplexity by up to 11\% while maintaining equivalent detection rates, achieves up to 30\% higher extraction rates for short texts, and performs comparably to non-watermarked text on downstream tasks like summarization and translation.
Paperid:1480
Authors:Chengting Yu · Xiaochen Zhao · Lei Liu · Shu Yang · Gaoang Wang · Erping Li · Aili Wang
Abstract:
Spiking Neural Networks (SNNs) are emerging as a braininspired alternative to traditional Artificial Neural Networks (ANNs), prized for their potential energy efficiency on neuromorphic hardware. Despite this, SNNs often suffer from accuracy degradation compared to ANNs and face deployment challenges due to fixed inference timesteps, which require retraining for adjustments, limiting operational flexibility. To address these issues, our work considers the spatio-temporal property inherent in SNNs, and proposes a novel distillation framework for deep SNNs that optimizes performance across full-range timesteps without specific retraining, enhancing both efficacy and deployment adaptability. We provide both theoretical analysis and empirical validations to illustrate that training guarantees the convergence of all implicit models across full-range timesteps. Experimental results on CIFAR-10, CIFAR-100, CIFAR10-DVS, and ImageNet demonstrate state-of-the-art performance among distillation-based SNNs training methods.
Paperid:1481
Authors:Weimin Wu · Teng-Yun Hsiao · Jerry Yao-Chieh Hu · Wenxin Zhang · Han Liu
Abstract:
By viewing LLMs (Large Language Models) as associative memories, we provide an interpretation of InContext Learning (ICL) as memory retrieval of the modern Hopfield model from a conditional memory set conditioned on ICL examples.Following the Bayesian interpretation of ICL,we interpret ICL of LLMs as an implicit hidden concept learning from the input prompt examples.Our key observation is that the ICL hidden concept learning is equivalent to reshaping the energy landscape of a probabilistic energy-based memory model (modern Hopfield model)such that the distribution of local minima (the memory patterns) relocates according to ICL examples.Our key contributions are:(i) We show that in-context examples encode into the pretrained Hopfield layer (attention mechanism) via memory importance reshaping.Intuitively, this reshaping corresponds to the formation/learning of a hidden concept from prompt, and hence the ICL response is just a memory retrieval of MHM with a conditioned/reshaped memory set. (ii) With the reshaping formulation, we derive the generalization bound of the ICL for the one-layer attention model.The bound clearly shows that the performance of ICL is better with more prompt examples.(iii)We use the reshaping to explain three important behaviors of ICL and implement relative experiments with linear regression.The behaviors include sensitivity to covariance shifts; sensitivity to response's accuracy, and sensitivity to the similarity between prompts and test query.
Paperid:1482
Authors:Charlie Hou · Mei-Yu Wang · Yige Zhu · Daniel Lazar · Giulia Fanti
Abstract:
In practical settings, differentially private Federated learning (DPFL) is the dominant method for training models from private, on-device client data (McMahan et al., 2017; Kairouz et al., 2021; Choquette-Choo et al., 2024). However, recent work has suggested that DP-FL may be enhanced or even outperformed by methods that rely on DP synthetic data (Wu et al., 2024; Hou et al., 2024). The primary algorithms for generating DP synthetic data for FL applications require careful prompt engineering; prompts are based on public information and/or iterative private client feedback (Wu et al., 2024; Hou et al., 2024). Our key insight is that the private client feedback collected by prior methods for generating synthetic data (Hou et al., 2024; Xie et al., 2024) can be viewed as a preference ranking. Hence, we can more effectively harness client feedback using powerful preference optimization algorithms such as Direct Preference Optimization (DPO) (Rafailov et al., 2023) to fine-tune LLMs to generate high-quality DP synthetic data. We substantially improve the utility of DP synthetic data relative to prior work; on our bioRxiv dataset, POPri closes the gap between next-token prediction accuracy in the fully-private and non-private settings by up to 68%, compared to 52% for prior synthetic data methods, and 10% for state-of-the-art DP federated learning methods. We showcase the performance of POPri on (1) an existing benchmark from Xie et al. (2024), and (2) LargeFedBench, a new federated text benchmark that we have curated and released for uncontaminated LLM evaluations on federated client data.
Paperid:1483
Authors:Nayoung Lee · Jack Cai · Avi Schwarzschild · Kangwook Lee · Dimitris Papailiopoulos
Abstract:
Large language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a selfimprovement approach where models iteratively generate and learn from their own solutions, progressively tackling harder problems while maintaining a standard transformer architecture. Across diverse tasks including arithmetic, string manipulation, and maze solving, our method enables models to solve problems far beyond their initial training distribution—for instance, generalizing from 10-digit to 100-digit addition without apparent saturation. We observe that filtering for correct self-generated examples leads to exponential improvements in out-of-distribution performance across training rounds. Additionally, starting from pretrained models significantly accelerates this self-improvement process for several tasks. Our results demonstrate how controlled weak-to-strong curricula can systematically expand model capabilities while preserving architectural simplicity.
Paperid:1484
Authors:Brett Barkley · David Fridovich-Keil
Abstract:
Dynastyle off-policy model-based reinforcement learning (DMBRL) algorithms are a family of techniques for generating synthetic state transition data and thereby enhancing the sample efficiency of off-policy RL algorithms. This paper identifies and investigates a surprising performance gap observed when applying DMBRL algorithms across different benchmark environments with proprioceptive observations. We show that, while DMBRL algorithms perform well in OpenAI Gym, their performance can drop significantly in DeepMind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process --- the backbone of Dyna-style algorithms --- significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL.
Paperid:1485
Authors:Diyuan Wu · Marco Mondelli
Abstract:
Neural Collapse is a phenomenon where the lastlayer representations of a well-trained neural network converge to a highly structured geometry. In this paper, we focus on its first (and most basic) property, known as NC1: the within-class variability vanishes. While prior theoretical studies establish the occurrence of NC1 via the data-agnostic unconstrained features model, our work adopts a data-specific perspective, analyzing NC1 in a three-layer neural network, with the first two layers operating in the mean-field regime and followed by a linear layer. In particular, we establish a fundamental connection between NC1 and the loss landscape: we prove that points with small empirical loss and gradient norm (thus, close to being stationary) approximately satisfy NC1, and the closeness to NC1 is controlled by the residual loss and gradient norm. We then show that (i) gradient flow on the mean squared error converges to NC1 solutions with small empirical loss, and (ii) for well-separated data distributions, both NC1 and vanishing test loss are achieved simultaneously. This aligns with the empirical observation that NC1 emerges during training while models attain near-zero test error. Overall, our results demonstrate that NC1 arises from gradient training due to the properties of the loss landscape, and they show the co-occurrence of NC1 and small test error for certain data distributions.
Paperid:1486
Authors:Mahendra Singh Thapa · Rui Li
Abstract:
Personalized federated learning (PFL) based on Bayesian approach tackle the challenges from statistical heterogeneity of client data by computing a personalized posterior distribution over the parameters of each client's local model and constructing a global distribution by aggregating the parameters of these personalized posteriors. However, the heuristic aggregation methods introduce strong biases and result in global models with poor generalization. We thus propose a novel hierarchical Bayesian inference framework for PFL by specifying a conjugate hyperprior over the parameters of the personalized posteriors. This allows us to jointly compute a global posterior distribution for aggregation and the personalized ones at local level. This hierarchical Bayesian inference framework achieves elegant balance between local personalization and global model robustness. Extensive empirical study shows that by effectively sharing the heterogeneous statistical strength across the local models while retaining their distinctive characteristics, our framework yields state-of-the-art performance. We also show that existing Bayesian PFLs are special cases of our framework.
Paperid:1487
Authors:Jinyu Cai · Yunhe Zhang · Fusheng Liu · See-Kiong Ng
Abstract:
A fundamental challenge in graphlevel anomaly detection (GLAD) is the scarcity of anomalous graph data, as the training dataset typically contains only normal graphs or very few anomalies. This imbalance hinders the development of robust detection models. In this paper, we propose Anomalous Graph Diffusion (AGDiff), a framework that explores the potential of diffusion model in generating pseudo-anomalous graphs for GLAD. Unlike existing diffusion-based methods that focus on modeling data normality, AGDiff leverages the latent diffusion framework to incorporate subtle perturbations into graph representations, thereby generating pseudo-anomalous graphs that closely resemble normal ones. By jointly training a classifier to distinguish these generated graph anomalies from normal graphs, AGDiff learns more discriminative decision boundaries. The shift from solely modeling normality to explicitly generating and learning from pseudo graph anomalies enables AGDiff to effectively identify complex anomalous patterns that other approaches might overlook. Comprehensive experimental results demonstrate that the proposed AGDiff significantly outperforms several state-of-the-art GLAD baselines.
Paperid:1488
Authors:Hung-Yueh Chiang · Chi-Chih Chang · Natalia Frumkin · Kai-Chiang Wu · Mohamed Abdelfattah · Diana Marculescu
Abstract:
Abstract:State Space Models (SSMs) are gaining attention as an efficient alternative to Transformers due to their constant memory complexity and comparable performance. Yet, deploying largescale SSMs on cloud-based services or resource-constrained devices faces challenges. To address this, quantizing SSMs using low bit-width data types is proposed to reduce model size and leverage hardware acceleration. Given that SSMs are sensitive to quantization errors, recent advancements focus on quantizing a specific model or bit-width to improve their efficiency while maintaining performance. However, different bit-width configurations, such as W4A8 for cloud service throughput and W4A16 for improving question-answering on personal devices, are necessary for specific scenarios.To this end, we present Quamba2, compatible with \textbf{W8A8}, \textbf{W4A8}, and \textbf{W4A16} for both \textbf{Mamba} and \textbf{Mamba2}, addressing the rising demand for SSM deployment across various platforms. We propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for $x$, combined with a per-state-group quantization for $B$ and $C$. To ensure compute-invariance in the SSM output, we offline rearrange weights according to the clustering sequence. The experiments show Quamba2-8B outperforms several state-of-the-art SSMs quantization methods and delivers 1.3$\times$ and 3$\times$ speedup in the pre-filling and generation stages and 4$\times$ memory reduction with only a $1.6$% accuracy drop on average. The code and quantized models will be released at:
Paperid:1489
Authors:Sungnyun Kim · Kangwook Jang · Sangmin Bae · Sungwoo Cho · Se-Young Yun
Abstract:
Audiovisual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.
Paperid:1490
Authors:Wenwen Qiang · Jingyao Wang · Zeen Song · Jiangmeng Li · Changwen Zheng
Abstract:
In this paper, we focus on the outof-distribution (OOD) generalization of self-supervised learning (SSL). By analyzing the mini-batch construction during the SSL training phase, we first give one plausible explanation for SSL having OOD generalization. Then, from the perspective of data generation and causal inference, we analyze and conclude that SSL learns spurious correlations during the training process, which leads to a reduction in OOD generalization. To address this issue, we propose a post-intervention distribution (PID) grounded in the Structural Causal Model. PID offers a scenario where the spurious variable and label variable is mutually independent. Besides, we demonstrate that if each mini-batch during SSL training satisfies PID, the resulting SSL model can achieve optimal worst-case OOD performance. This motivates us to develop a batch sampling strategy that enforces PID constraints through the learning of a latent variable model. Through theoretical analysis, we demonstrate the identifiability of the latent variable model and validate the effectiveness of the proposed sampling strategy. Experiments conducted on various downstream OOD tasks demonstrate the effectiveness of the proposed sampling strategy.
Paperid:1491
Authors:Zhan Zhuang · Xiequn Wang · Wei Li · Yulong Zhang · Qiushi Huang · Shuhao Chen · Xuehao Wang · Yanbin Wei · Yuhe Nie · Kede Ma · Yu Zhang · Ying Wei
Abstract:
Abstract:LowRank Adaptation (LoRA) has become a widely used technique for efficiently fine-tuning foundation models. However, standard LoRA training can lead to suboptimal local minima, where adapter updates are unevenly distributed across layers, potentially limiting generalization and adaptability for tasks such as model merging and pruning. To address this, we propose $\mathrm{CoTo}$, a simple yet effective strategy that progressively deactivates adapters during training, promoting more balanced optimization across layers. We provide theoretical insights into its mechanism and validate its effectiveness through extensive experiments on vision-language models, large language models, and diffusion models. Results show that $\mathrm{CoTo}$ consistently enhances performance while maintaining broad compatibility with various LoRA variants and other strategies.
Paperid:1492
Authors:Tian Jin · Ellie Cheng · Zachary Ankner · Nikunj Saunshi · Ananda Suresh · Blake Elias · Amir Yazdanbakhsh · Jonathan Ragan-Kelley · Suvinay Subramanian · Michael Carbin
Abstract:
Decoding with autoregressive language models traditionally occurs sequentially, generating one token after another. Recent attempts to introduce parallelism require a predetermined structure in the generated content to implement parallel generation, such as by pattern-matching on bullet points. In this work, we present a new technique to automate parallel generation by dynamically exploiting the semantic independence of generation outputs to implement asynchronous decoding. We introduce an annotation language Pasta-Lang for language models to initiate asynchronous decoding at inference time. We also develop an accompanying Pasta-Lang interpreter that performs on-the-fly asynchronous decoding, effectively implementing parallel generation and speeding up inference. We present an instruction-finetuning dataset with Pasta-Lang-annotated responses for teaching LLMs to annotate semantic independence with Pasta-Lang as well as the methodology for creating the dataset. Our evaluation shows using the interpreter with a Pasta-Lang-equipped model achieves significant speedup while maintaining the same generation quality.
Paperid:1493
Authors:Ermo Hua · Che Jiang · Xingtai Lv · Kaiyan Zhang · Ning Ding · Youbang Sun · Biqing Qi · Yuchen Fan · Xuekai Zhu · Bowen Zhou
Abstract:
Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend.While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPEbased attention.UsingDiscrete Signal Processingtheory, we show that RoPE enables periodic attention by implicitly achievingNon-Uniform Discrete Fourier Transform.However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we proposeFourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructsFourier Seriesand zero-outs the destructive frequency components, increasing model robustness against the spectrum damage.Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi.Several analyses and ablations bring further support to our method and theoretical modeling.
Paperid:1494
Authors:Jinmin He · Kai Li · Yifan Zang · Haobo Fu · Qiang Fu · Junliang Xing · Jian Cheng
Abstract:
Offline multitask reinforcement learning aims to learn a unified policy capable of solving multiple tasks using only pre-collected datasets, without requiring any online interaction with the environment. However, it faces significant challenges in effectively sharing knowledge across tasks. Inspired by the efficient knowledge abstraction observed in human learning, we propose Goal-Oriented Skill Abstraction (GO-Skill), a novel approach designed to extract and utilize reusable skills to enhance knowledge transfer and task performance in offline multi-task reinforcement learning. Our approach uncovers reusable skills through a goal-oriented skill extraction process and leverages vector quantization to construct a discrete skill library. To mitigate class imbalances between broadly applicable and taskspecific skills, we introduce a skill enhancement phase to refine the extracted skills. Furthermore, we integrate these skills using hierarchical policy learning, enabling the construction of a high-level policy that dynamically orchestrates discrete skills to accomplish specific tasks. Extensive experiments on diverse robotic manipulation tasks within the MetaWorld benchmark demonstrate the effectiveness and versatility of GO-Skill. The source code is available in the supplementary.
Paperid:1495
Authors:Zhijing Wan · Zhixiang Wang · Zheng Wang · Xin Xu · Shin'ichi Satoh
Abstract:
Oneshot subset selection serves as an effective tool to reduce deep learning training costs by identifying an informative data subset based on the information extracted by an information extractor (IE). Traditional IEs, pre-trained on the target dataset, are inherently dataset-dependent. Foundation models (FMs) offer a promising alternative, potentially mitigating this limitation. This work investigates two key questions: (1) Can FM-based subset selection outperform traditional IE-based methods across diverse datasets? (2) Do all FMs perform equally well as IEs for subset selection? Extensive experiments uncovered surprising insights: FMs are notably less effective for coarse-grained datasets with noisy labels compared to traditional IEs, yet they excel in fine-grained datasets. Motivated by these findings, we propose RAM-APL (RAnking Mean-Accuracy of Pseudo-class Labels), a method designed for fine-grained image datasets. RAM-APL leverages multiple FMs to enhance subset selection by exploiting their complementary strengths. Our approach achieves state-of-the-art performance on fine-grained datasets, including Oxford-IIIT Pet, Food-101, and Caltech-UCSD Birds-200-2011.
Paperid:1496
Authors:Aniket Murhekar · Jiaxin Song · Parnian Shahkar · Bhaskar Ray Chaudhury · Ruta Mehta
Abstract:
Abstract:Federated learning (FL) is a popular collaborative learning paradigm, whereby agents with individual datasets can jointly train an ML model. While higher data sharing improves model accuracy and leads to higher payoffs, it also raises costs associated with data acquisition or loss of privacy, causing agents to be strategic about their data contribution. This leads to undesirable behavior at a Nash equilibrium (NE) such as *freeriding*, resulting in sub-optimal fairness, data sharing, and welfare.To address this, we design $\mathcal{M}^{Shap}$, a budget-balanced payment mechanism for FL, that admits Nash equilibria under mild conditions, and achieves *reciprocal fairness*: where each agent's payoff equals her contribution to the collaboration, as measured by the Shapley share. In addition to fairness, we show that the NE under $\mathcal{M}^{Shap}$ has desirable guarantees in terms of accuracy, welfare, and total data collected.We validate our theoretical results through experiments, demonstrating that $\mathcal{M}^{Shap}$ outperforms baselines in terms of fairness and efficiency.
Paperid:1497
Authors:Zhiyao Zhang · Myeung Suk Oh · Hairi · Ziyue Luo · Alvaro Velasquez · Jia (Kevin) Liu
Abstract:
Abstract:Actorcritic methods for decentralized multi-agent reinforcement learning (MARL) facilitate collaborative optimal decision making without centralized coordination, thus enabling a wide range of applications in practice. To date, however, most theoretical convergence studies for existing actor-critic decentralized MARL methods are limited to the guarantee of a stationary solution under the linear function approximation. This leaves a significant gap between the highly successful use of deep neural actor-critic for decentralized MARL in practice and the current theoretical understanding. To bridge this gap, in this paper, we make the first attempt to develop a deep neural actor-critic method for decentralized MARL, where both the actor and critic components are inherently non-linear. We show that our proposed method enjoys a global optimality guarantee with a finite-time convergence rate of $\mathcal{O}(1/T)$, where $T$ is the total iteration times. This marks the first global convergence result for deep neural actor-critic methods in the MARL literature. We also conduct extensive numerical experiments, which verify our theoretical results.
Paperid:1498
Authors:Hyo Kim · Dongyoon Han · Junsuk Choe
Abstract:
Machine unlearning aims to selectively remove specific knowledge from a trained model. Existing approaches, such as task arithmetic, finetune the model on the forget set to create a task vector (i.e., a direction in weight space) for subtraction from the original weight. However, their effectiveness is highly sensitive to hyperparameter selection, requiring extensive validation to identify the optimal vector from many fine-tuned candidates. In this paper, we propose a novel method that utilizes all fine-tuned models trained with varying hyperparameters instead of a single selection. Specifically, we aggregate the computed task vectors by retaining only the elements with consistent shared signs. The merged task vector is then negated to induce unlearning on the original model. Evaluations on zero-shot and standard image recognition tasks across ten datasets and three backbone architectures show that our approach achieves superior unlearning performance. It outperforms state-of-the-art methods while requiring similar or fewer computational resources. Our code will be open-sourced.
Paperid:1499
Authors:Jinuk Kim · Marwa El Halabi · Wonpyo Park · Clemens Schaefer · Deokjae Lee · Yeonhong Park · Jae W. Lee · Hyun Oh Song
Abstract:
Posttraining quantization is a key technique for reducing the memory and latency of large language models by quantizing weights and activations without requiring retraining. However, existing methods either (1) fail to account for the varying importance of layer-wise outputs to the end-to-end loss or (2) neglect the critical interactions between model weights in the objective. To address these limitations, we propose GuidedQuant, a novel quantization framework that integrates gradient information from the end-to-end loss into the quantization objective while explicitly modeling inter-weight dependencies. GuidedQuant consistently boosts the performance of state-of-the-art quantization algorithms across scalar and vector quantization methods. Additionally, we introduce a novel non-uniform scalar quantization algorithm, which outperforms the existing layer-wise method in this category and guarantees no increase in the quantization objective.
Paperid:1500
Authors:Zongzhen Yang · Binhang Qi · Hailong Sun · Wenrui Long · Ruobing Zhao · Xiang Gao
Abstract:
Abstract:Model merging based on task vectors, i.e., the parameter differences between finetuned models and a shared base model, provides an efficient way to integrate multiple task-specific models into a multitask model without retraining. Recent works have endeavored to address the conflicts between task vectors, one of the significant challenges faced by model merging, through sparsification; however, two issues significantly limit their performance: *high parameter overlap* and *unbalanced weight distribution*. To address these issues, we propose a simple yet effective framework called **CABS** (Conflict-Aware and Balanced Sparsification), consisting of **C**onflict-**A**ware Sparsification (CA) and **B**alanced **S**parsification (BS). CA reduces parameter overlap by applying masks during sequential pruning, ensuring that each task vector retains distinct, non-overlapping parameters. BS leverages $n$:$m$ pruning to preserve critical weights while maintaining an even distribution across layers. Our comprehensive experiments demonstrate that CABS outperforms state-of-the-art methods across diverse tasks and model sizes.
Paperid:1501
Authors:Huayu Deng · Xiangming Zhu · Yunbo Wang · Xiaokang Yang
Abstract:
Graph neural networks have been a powerful tool for meshbased physical simulation. To efficiently model large-scale systems, existing methods mainly employ hierarchical graph structures to capture multi-scale node relations. However, these graph hierarchies are typically fixed and manually designed, which do not adapt to the evolving dynamics present in complex physical systems. In this paper, we introduce EvoMesh, a novel approach that learns time-evolving graph hierarchies through differentiable node selection, adaptively guided by physical inputs. The key component is the anisotropic message passing, which enables direction-specific aggregation of dynamic features between adjacent nodes within each graph hierarchy. Additionally, it determines node selection probabilities for the next hierarchy based on the physical contexts, thereby creating more flexible message shortcuts for learning long-range dependencies. Our experiments demonstrate the effectiveness of EvoMesh, achieving an average improvement of 22.7% compared to recent fixed-hierarchy message passing networks across five classic physical simulation datasets.
Paperid:1502
Authors:ali zindari · Parham Yazdkhasti · Anton Rodomanov · Tatjana Chavdarova · Sebastian Stich
Abstract:
We introduceDecoupled SGDA, a novel adaptation of Stochastic Gradient Descent Ascent (SGDA) tailored for multiplayer games with intermittent strategy communication. Unlike prior methods, Decoupled SGDA enables players to update strategies locally using outdated opponent strategies, significantly reducing communication overhead. For StronglyConvex-Strongly-Concave (SCSC) games, it achieves near-optimal communication complexity comparable to the best-known GDA rates. Forweakly coupledgames where the interaction between players is lower relative to the non-interactive part of the game, Decoupled SGDA significantly reduces communication costs compared to standard SGDA. Additionally,Decoupled SGDAoutperforms federated minimax approaches in noisy, imbalanced settings. These results establishDecoupled SGDAas a transformative approach for distributed optimization in resource-constrained environments.
Paperid:1503
Authors:Xinyue Xu · Yueying Hu · Hui Tang · Yi Qin · Lu Mi · Hao Wang · Xiaomeng Li
Abstract:
Concept Bottleneck Models (CBMs) enhance interpretability by explaining predictions through humanunderstandable concepts but typically assume that training and test data share the same distribution. This assumption often fails under domain shifts, leading to degraded performance and poor generalization. To address these limitations and improve the robustness of CBMs, we propose the Concept-based Unsupervised Domain Adaptation (CUDA) framework. CUDA is designed to: (1) align concept representations across domains using adversarial training, (2) introduce a relaxation threshold to allow minor domain-specific differences in concept distributions, thereby preventing performance drop due to over-constraints of these distributions, (3) infer concepts directly in the target domain without requiring labeled concept data, enabling CBMs to adapt to diverse domains, and (4) integrate concept learning into conventional domain adaptation (DA) with theoretical guarantees, improving interpretability and establishing new benchmarks for DA. Experiments demonstrate that our approach significantly outperforms the state-of-the-art CBM and DA methods on real-world datasets.
Paperid:1504
Authors:Zhuocheng Gong · Jian Guan · Wei Wu · Huishuai Zhang · Dongyan Zhao
Abstract:
Large language models (LLMs) have achieved remarkable success, yet aligning their generations with human preferences remains a critical challenge. Existing approaches to preference modeling often rely on an explicit or implicit reward function, overlooking the intricate and multifaceted nature of human preferences that may encompass conflicting factors across diverse tasks and populations. To address this limitation, we introduce Latent Preference Coding (LPC), a novel framework that models the implicit factors as well as their combinations behind holistic preferences using discrete latent codes. LPC seamlessly integrates with various offline alignment algorithms, automatically inferring the underlying factors and their importance from data without relying on predefined reward functions and hand-crafted combination weights. Extensive experiments on multiple benchmarks demonstrate that LPC consistently improves upon three alignment algorithms (DPO, SimPO, and IPO) using three base models (Mistral-7B, Llama3-8B, and Llama3-Instruct-8B). Furthermore, deeper analysis reveals that the learned latent codes effectively capture the differences in the distribution of human preferences and significantly enhance the robustness of alignment algorithms against noise in data. By providing a unified representation for the multifarious preference factors, LPC paves the way towards developing more robust and versatile alignment techniques for responsible deployment of powerful LLMs.
Paperid:1505
Authors:Zimeng Li · Hongjun LI · Jingyuan Wang · Ke Tang
Abstract:
Abstract:While deep learning has witnessed remarkable achievements in a wide range of applications, its substantial computational cost imposes limitations on the scales and application scenarios of neural networks. To alleviate this problem, lowrank compression is proposed as a class of efficient and hardware-friendly network compression methods, which reduce computation by replacing large matrices in neural networks with products of two small ones. In this paper, we implement low-rank networks by inserting a sufficiently narrow linear layer without bias between each of two adjacent nonlinear layers. We prove that low-rank Swish networks with a fixed depth are capable of approximating any function from the Hölder ball $\mathcal{C}^{\beta, R}([0,1]^d)$ within an arbitrarily small error where $\beta$ is the smooth parameter and $R$ is the radius. Our proposed constructive approximation ensures that the width of linear hidden layers required for approximation is no more than one-third of the width of nonlinear layers, which implies that the computational cost can be decreased by at least one-third compared with a network with the same depth and width of nonlinear layers but without narrow linear hidden layers. Our theoretical findings can provide an explanation for successful applications of low-rank compression and give useful ideas for designing the sizes of small matrices.
Paperid:1506
Authors:Yunhao Tang · Kunhao Zheng · Gabriel Synnaeve · REMI MUNOS
Abstract:
Abstract:In this work, we investigate the merits of explicitly optimizing for inference time algorithmic performance during model training. We show how optimizing for inference time performance can improve overall model efficacy. We consider generic inference time objectives with $k$ samples, with focus on pass@$k$ and majority voting as two main applications. With language model training on reasoning datasets, we showcase the performance tradeoff enabled by training with such objectives. When training on code generation tasks, we show that the approach significantly improves pass@$k$ objectives compared to the baseline method.
Paperid:1507
Authors:Shuo Chen · Chen Gong · Jun Li · Jian Yang
Abstract:
Measuring the similarity between data points plays a vital role in lots of popular representation learning tasks such as metric learning and contrastive learning. Most existing approaches utilize pointlevel distances to learn the point-to-point similarity between pairwise instances. However, since the finite number of training data points cannot fully cover the whole sample space consisting of an infinite number of points, the generalizability of the learned distance is usually limited by the sample size. In this paper, we thus extend the conventional form of data point to the new form of data ball with a predictable volume, so that we can naturally generalize the existing point-level distance to a new volume-aware distance (VAD) which measures the field-to-field geometric similarity. The learned VAD not only takes into account the relationship between observed instances but also uncovers the similarity among those unsampled neighbors surrounding the training data. This practice significantly enriches the coverage of sample space and thus improves the model generalizability. Theoretically, we prove that VAD tightens the error bound of traditional similarity learning and preserves crucial topological properties. Experiments on multi-domain data demonstrate the superiority of VAD over existing approaches in both supervised and unsupervised tasks.
Paperid:1508
Authors:Ziyad Benomar · Lorenzo Croissant · Spyros Angelopoulos · Vianney Perchet
Abstract:
Onemax search is a classic problem in online decision-making, in which a trader acts on a sequence of revealed prices and accepts one of them irrevocably to maximise its profit. The problem has been studied both in probabilistic and in worst-case settings, notably through competitive analysis, and more recently in learning-augmented settings in which the trader has access to a prediction on the sequence. However, existing approaches either lack smoothness, or do not achieve optimal worst-case guarantees: they do not attain the best possible trade-off between the consistency and the robustness of the algorithm. We close this gap by presenting the first algorithm that simultaneously achieves both of these important objectives. Furthermore, we show how to leverage the obtained smoothness to provide an analysis of one-max search in stochastic learning-augmented settings which capture randomness in both the observed prices and the prediction.
Paperid:1509
Authors:Yang Chen · Long Yang · Yitao Liang · Zhouchen Lin
Abstract:
LowDimension-to-High-Dimension (LDHD) generalization, a subset of Out-of-Distribution (OOD) generalization, involves training on a low-dimensional subspace and testing in a high-dimensional space. Assuming instances are generated from latent variables reflecting problem scale, LDHD generalization captures the inherent scaling challenge of length generalization. We theoretically show that LDHD generalization is unattainable without appropriate inductive bias. Focusing on Boolean functions, we demonstrate that different architectures trained with (S)GD converge tomin-degree interpolators w.r.t. different linearly independent sets, achieving LDHD generalization only when the target function aligns with this bias. From the perspective of LDHD generalization for length generalization, we explain the success of CoT in restructuring latent space for improved LDHD generalization. We further propose a principle for designing position embeddings to address both LDHD generalization and data format nuisances separately. Following the principle, we introduce RPE-Square, a novel embedding that enhances RPE to better handle data formats.
Paperid:1510
Authors:Ao Shen · Ming'zhi Yuan · Yingfan MA · Jie Du · qiao Huang · Manning Wang
Abstract:
Virtual screening is a critical step in drug discovery, aiming at identifying potential drugs that bind to a specific protein pocket from a large database of molecules. Traditional docking methods are timeconsuming, while learning-based approaches supervised by high-precision conformational or affinity labels are limited by the scarcity of training data. Recently, a paradigm of feature alignment through contrastive learning has gained widespread attention. This method does not require explicit binding affinity scores, but it suffers from the issue of overly simplistic construction of negative samples, which limits their generalization to more difficult test cases. In this paper, we propose Drug-TTA, which leverages a large number of self-supervised auxiliary tasks to adapt the model to each test instance. Specifically, we incorporate the auxiliary tasks into both the training and the inference process via meta-learning to improve the performance of the primary task of virtual screening. Additionally, we design a multi-scale feature based Auxiliary Loss Balance Module (ALBM) to balance the auxiliary tasks to improve their efficiency. Extensive experiments demonstrate that Drug-TTA achieves state-of-the-art (SOTA) performance in all five zero-shot virtual screening tasks, showing an average improvement of 9.86\% in AUROC metric compared to the baseline without test-time adaptation.
Paperid:1511
Authors:Sijia Zhang · Shuli Zeng · Shaoang Li · Feng Wu · Shaojie Tang · Xiangyang Li
Abstract:
Many realworld applications, such as logistics, routing, scheduling, and production planning, involve dynamic systems that require continuous updates to solutions for new Mixed Integer Linear Programming (MILP) problems. These systems often require rapid updates to their solutions to accommodate slight modifications in constraints or objectives introduced by evolving conditions.While reoptimization techniques have been explored for Linear Programming (LP) and certain specific MILP problems, their effectiveness in addressing general MILP is limited. In this work, we propose a two-stage reoptimization framework for efficiently identifying high-quality feasible solutions. Specifically, we first utilize the historical solving process information to predict a high confidence solution space for modified MILPs, which is likely to contain high-quality solutions. Building on the prediction results, we fix a part of variables within the predicted intervals and apply the Thompson Sampling algorithm to determine which variables to fix. This is done by updating the Beta distributions based on the solutions obtained from the solver. Extensive experiments across nine reoptimization datasets show that our VP-OR outperforms the state-of-the-art methods, achieving higher-quality solutions under strict time limits.
Paperid:1512
Authors:Ke Liu · Hao Chen · Chunhua Shen
Abstract:
Developing models for proteinligand interactions holds substantial significance for drug discovery. Supervised methods often failed due to the lack of labeled data for predicting the protein-ligand binding energy, like antibodies. Therefore, unsupervised approaches are urged to make full use of the unlabeled data. To tackle the problem, we propose an efficient, unsupervised protein-ligand binding energy prediction model via the conservation of energy (CEBind), which follows the physical laws. Specifically, given a protein-ligand complex, we randomly sample forces for each atom in the ligand. Then these forces are applied rigidly to the ligand to perturb its position, following the law of rigid body dynamics. Finally, CEBind predicts the energy of both the unperturbed complex and the perturbed complex. The energy gap between two complexes equals the work of the outer forces, following the law of conservation of energy. Extensive experiments are conducted on the unsupervised protein-ligand binding energy prediction benchmarks, comparing them with previous works. Empirical results and theoretic analysis demonstrate that CEBind is more efficient and outperforms previous unsupervised models on benchmarks.
Paperid:1513
Authors:Yifan Hou · Buse Giledereli · Yilei Tu · Mrinmaya Sachan
Abstract:
Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Yet, recent studies seem to suggest that Large VisionLanguage Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across several domains to evaluate the recognition and reasoning abilities of models. Our evaluation of six LVLMs shows that while these models can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.
Paperid:1514
Authors:Ruixiang Zhang · Shuangfei Zhai · Yizhe Zhang · James Thornton · Zijing Ou · Joshua M Susskind · Navdeep Jaitly
Abstract:
Discrete diffusion is a promising framework for modeling and generating discrete data.In this work, we present Target Concrete Score Matching (\tcsm), a novel and versatile objective for training and finetuning discrete diffusion models.\tcsm provides a general framework with broad applicability. It supports pre-training discrete diffusion models directly from data samples, and many existing discrete diffusion approaches naturally emerge as special cases of our more general \tcsm framework.Furthermore, the same \tcsm objective extends to post-training of discrete diffusion models, including fine-tuning using reward functions or preference data, and distillation of knowledge from pre-trained autoregressive models.These new capabilities stem from the core idea of \tcsm, estimating the concrete score of the target distribution, which resides in the original ~(clean) data space. This allows seamless integration with reward functions and pre-trained models, which inherently only operate in the clean data space rather than the noisy intermediate spaces of diffusion processes.Our experiments on language modeling tasks demonstrate that \tcsm matches or surpasses current methods. Additionally, \tcsm is versatile, applicable to both pre-training and post-training scenarios, offering greater flexibility and sample efficiency.
Paperid:1515
Authors:Jingyue Gao · Runji Lin · Keming Lu · Bowen Yu · Junyang Lin · Jianyu Chen
Abstract:
Large Language Models (LLMs) exhibit strong potential in mathematical reasoning, yet their effectiveness is often limited by a shortage of highquality queries.This limitation necessitates scaling up computational responses through self-generated data, yet current methods struggle due to spurious correlated data caused by ineffective exploration across all reasoning stages.To address such challenge, we introduceMARGE: ImprovingMathReasoning withGuidedExploration, a novel method that enhances mathematical reasoning through hit-guided exploration.MARGE systematically explores intermediate reasoning states derived from self-generated solutions, enabling adequate exploration and improved credit assignment throughout the reasoning process.Notably, MARGE improves both single-shot accuracy and exploration diversity, mitigating a common trade-off in alignment methods.These results demonstrate MARGE's effectiveness in enhancing mathematical reasoning capabilities and unlocking the potential of scaling self-generated training data.
Paperid:1516
Authors:Heman Shakeri · Justin Lee · Behnaz Moradi-Jamei
Abstract:
Modeling the evolution of highdimensional systems from limited snapshot observations at irregular time points poses a significant challenge in quantitative biology and related fields. Traditional approaches often rely on dimensionality reduction techniques, which can oversimplify the dynamics and fail to capture critical transient behaviors in non-equilibrium systems. We present Multi-Marginal Stochastic Flow Matching (MMSFM), a novel extension of simulation-free score and flow matching methods to the multi-marginal setting, enabling the alignment of high-dimensional data measured at non-equidistant time points without reducing dimensionality. The use of measure-valued splines enhances robustness to irregular snapshot timing, and score matching prevents overfitting in high-dimensional spaces. We validate our framework on several synthetic and benchmark datasets and apply it to single-cell perturbation data from melanoma cell lines and gene expression data collected at uneven time points.
Paperid:1517
Authors:Tim Kim · Thomas Luo · Tankut Can · Kamesh Krishnamurthy · Jonathan Pillow · Carlos Brody
Abstract:
Computations involved in processes such as decisionmaking, working memory, and motor control are thought to emerge from the dynamics governing the collective activity of neurons in large populations. But the estimation of these dynamics remains a significant challenge. Here we introduce Flow-field Inference from Neural Data using deep Recurrent networks (FINDR), an unsupervised deep learning method that can infer low-dimensional nonlinear stochastic dynamics underlying neural population activity. Using population spike train data from frontal brain regions of rats performing an auditory decision-making task, we demonstrate that FINDR outperforms existing methods in capturing the heterogeneous responses of individual neurons. We further show that FINDR can discover interpretable low-dimensional dynamics when it is trained to disentangle task-relevant and irrelevant components of the neural population activity. Importantly, the low-dimensional nature of the learned dynamics allows for explicit visualization of flow fields and attractor structures. We suggest FINDR as a powerful method for revealing the low-dimensional task-relevant dynamics of neural populations and their associated computations.
Paperid:1518
Authors:Xianliang Xu · Ye Li · Zhongyi Huang
Abstract:
Abstract:In this paper, we derive refined generalization bounds for the Deep Ritz Method (DRM) and PhysicsInformed Neural Networks (PINNs). For the DRM, we focus on two prototype elliptic partial differential equations (PDEs): Poisson equation and static Schrödinger equation on the $d$-dimensional unit hypercube with the Neumann boundary condition. Furthermore, sharper generalization bounds are derived based on the localization techniques under the assumptions that the exact solutions of the PDEs lie in the Barron spaces or the general Sobolev spaces. For the PINNs, we investigate the general linear second order elliptic PDEs with Dirichlet boundary condition using the local Rademacher complexity in the multi-task learning setting. Finally, we discuss the generalization error in the setting of over-parameterization when solutions of PDEs belong to Barron space.
Paperid:1519
Authors:Guy Kornowski · Daogao Liu · Kunal Talwar
Abstract:
Abstract:We study differentially private (DP) optimization algorithms for stochastic and empirical objectives which are neither smooth nor convex, and propose methods that return a Goldsteinstationary point with sample complexity bounds that improve on existing works.We start by providing a single-pass $(\epsilon,\delta)$-DP algorithm that returns an $(\alpha,\beta)$-stationary point as long as the dataset is of size $\widetilde{\Omega}\left(1/\alpha\beta^{3}+d/\epsilon\alpha\beta^{2}+d^{3/4}/\epsilon^{1/2}\alpha\beta^{5/2}\right)$, which is $\Omega(\sqrt{d})$ times smaller than the algorithm of Zhang et al. (2024) for this task, where $d$ is the dimension.We then provide a multi-pass polynomial time algorithm which further improves the sample complexity to $\widetilde{\Omega}\left(d/\beta^2+d^{3/4}/\epsilon\alpha^{1/2}\beta^{3/2}\right)$, by designing a sample efficient ERM algorithm, and proving that Goldstein-stationary points generalize from the empirical loss to the population loss.
Paperid:1520
Authors:Yuankai Luo · Lei Shi · Xiao-Ming Wu
Abstract:
Messagepassing Graph Neural Networks (GNNs) are often criticized for their limited expressiveness, issues like over-smoothing and over-squashing, and challenges in capturing long-range dependencies, while Graph Transformers (GTs) are considered superior due to their global attention mechanisms. Literature frequently suggests that GTs outperform GNNs, particularly in graph-level tasks such as graph classification and regression. In this study, we explore the untapped potential of GNNs through an enhanced framework, GNN+, which integrates six widely used techniques: edge feature integration, normalization, dropout, residual connections, feed-forward networks, and positional encoding, to effectively tackle graph-level tasks. We conduct a systematic evaluation of three classic GNNs—GCN, GIN, and GatedGCN—enhanced by the GNN+ framework across 14 well-known graph-level datasets. Our results show that, contrary to the prevailing belief, classic GNNs excel in graph-level tasks, securing top three rankings across all datasets and achieving first place in eight, while also demonstrating greater efficiency than GTs. This highlights the potential of simple GNN architectures, challenging the belief that complex mechanisms in GTs are essential for superior graph-level performance.We provide our code in the supplementary materials for reproducibility.
Paperid:1521
Authors:Nay Myat Min · Long H. Pham · Yige Li · Jun Sun
Abstract:
Large Language Models (LLMs) are vulnerable to backdoor attacks that manipulate outputs via hidden triggers. Existing defense methods—designed for vision/text classification tasks—fail for text generation. We proposeInternal Consistency Regularization (CROW), a defense leveraging the observation that backdoored models exhibit unstable layerwise hidden representations when triggered, while clean models show smooth transitions. CROW enforces consistency across layers via adversarial perturbations and regularization during finetuning, neutralizing backdoors without requiring clean reference models or trigger knowledge—only a small clean dataset. Experiments across Llama-2 (7B, 13B), CodeLlama (7B, 13B), and Mistral-7B demonstrate CROW’s effectiveness: it achieves significant reductions in attack success rates across diverse backdoor strategies (sentiment steering, targeted refusal, code injection) while preserving generative performance. CROW’s architecture-agnostic design enables practical deployment.
Paperid:1522
Authors:Hee Suk Yoon · Eunseop Yoon · Mark Hasegawa-Johnson · Sungwoong Kim · Chang Yoo
Abstract:
Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO), have emerged as alternatives to Reinforcement Learning from Human Feedback (RLHF) for preference learning in Large Language Models (LLMs). DAAs eliminate the need for a separate reward model by using the joint probability of sentences as an implicit reward, optimizing the margin between preferred and unpreferred responses. However, these methods uniformly adjust token probabilities, regardless of their relevance to preference, leading to suboptimal performances. Recent tokenlevel approaches attempt to address this issue but rely on trained credit-assignment models or AI annotators, raising concerns about quality, reliability, and computational overhead. In this paper, we propose ConfPO, which identifies preference-critical tokens based on the training policy’s confidence, thus requiring no additional models or compute for token selection. By focusing on a smaller set of high-impact tokens, ConfPO not only enhances alignment outcomes but also mitigates overoptimization (i.e., reward hacking) by more efficiently using the KL divergence budget. Experimental results on prominent alignment benchmarks—such as AlpacaEval 2 and Arena-Hard—across various LLMs show that ConfPO consistently outperforms DAAs that uniformly optimize over all tokens for preference learning with no computational overhead.
Paperid:1523
Authors:Yunjuan Wang · Raman Arora
Abstract:
Despite the remarkable achievements of large foundation models across various tasks, they remain vulnerable to security threats, including backdoor attacks. By injecting poisoned data containing backdoor triggers during training, adversaries can manipulate model predictions by exploiting these triggers. While existing research focuses on designing effective backdoor attacks and evaluating them empirically, it often lacks a rigorous theoretical understanding of when and why such attacks succeed. In this work, we investigate backdoor attacks targeting the token selection process within attention mechanisms—a core component of transformerbased architectures. We prove that single-head self-attention transformers can interpolate poisoned training data through gradient descent. Furthermore, we demonstrate that when the injected poisoned training data contains sufficiently strong backdoor triggers, but is not overly dominant, attackers can successfully manipulate model predictions. Our analysis reveals the dynamics of how adversaries manipulate token selection to compromise model outputs and identifies the theoretical conditions enabling these attacks. We support our theoretical findings with an empirical study on synthetic data.
Paperid:1524
Authors:Xinpeng Dong · Min Zhang · Didi Zhu · Ye Jian · zhang keli · Aimin Zhou · Fei Wu · Kun Kuang
Abstract:
Pretrained vision-language models (VLMs) have revolutionized the field of machine learning, demonstrating exceptional performance across a wide range of tasks. However, their robustness remains vulnerable to the spurious-correlation problem. Existing works often involve fine-tuning the model with labeled data or relying on large language models (LLMs) to generate more complex prompts. Although effective to some extent, these methods introduce new challenges, including additional computational costs and dependence on the quality of prompts without fully utilizing the vision modality. To address these limitations, we propose a novel method named ERICT to Enhance model Robustness by Identifying Concept Tokens. ERICT mitigates spurious correlation directly in the inference stage and comprises two key steps: (1) Identify concept tokens capturing invariant features through auxiliary prompts to generate a token-level mask. (2) Apply the mask to the attention weights of the CLS token in the vision encoder to help the model focus on the relevant image region. Extensive experiments show that ERICT significantly improves the overall performance including that of the worst group, and achieves new state-of-the-art results.
Paperid:1525
Authors:Cansu Sancaktar · Christian Gumbsch · Andrii Zadaianchuk · Pavel Kolev · Georg Martius
Abstract:
Exploration is a cornerstone of reinforcement learning (RL). Intrinsic motivation attempts to decouple exploration from external, taskbased rewards. However, established approaches to intrinsic motivation that follow general principles such as information gain, often only uncover low-level interactions. In contrast, children’s play suggests that they engage in meaningful high-level behavior by imitating or interacting with their caregivers. Recent work has focused on using foundation models to inject these semantic biases into exploration. However, these methods often rely on unrealistic assumptions, such as language-embedded environments or access to high-level actions. We propose SEmaNtically Sensible ExploratIon (SENSEI), a framework to equip model-based RL agents with an intrinsic motivation for semantically meaningful behavior. SENSEI distills a reward signal of interestingness from Vision Language Model (VLM) annotations, enabling an agent to predict these rewards through a world model. Using model-based RL, SENSEI trains an exploration policy that jointly maximizes semantic rewards and uncertainty. We show that in both robotic and video game-like simulations SENSEI discovers a variety of meaningful behaviors from image observations and low-level actions. SENSEI provides a general tool for learning from foundation model feedback, a crucial research direction, as VLMs become more powerful.
Paperid:1526
Authors:Amr Alkhatib · Roman Bresson · Henrik Boström · Michalis Vazirgiannis
Abstract:
Shapley values have several desirable, theoretically wellsupported, properties for explaining black-box model predictions. Traditionally, Shapley values are computed post-hoc, leading to additional computational cost at inference time. To overcome this, a novel method, called ViaSHAP, is proposed, that learns a function to compute Shapley values, from which the predictions can be derived directly by summation. Two approaches to implement the proposed method are explored; one based on the universal approximation theorem and the other on the Kolmogorov-Arnold representation theorem. Results from a large-scale empirical investigation are presented, showing that ViaSHAP using Kolmogorov-Arnold Networks performs on par with state-of-the-art algorithms for tabular data. It is also shown that the explanations of ViaSHAP are significantly more accurate than the popular approximator FastSHAP on both tabular data and images.
Paperid:1527
Authors:Bokun Wang · Tianbao Yang
Abstract:
This paper studies a class of convex Finitesum Coupled Compositional Optimization (cFCCO) problems with applications including group distributionally robust optimization (GDRO) and learning with imbalanced data. To better address these problems, we introduce an efficient single-loop primal-dual block-coordinate stochastic algorithm called ALEXR. The algorithm employs block-coordinate stochastic mirror ascent with extrapolation for the dual variable and stochastic proximal gradient descent updates for the primal variable. We establish the convergence rates of ALEXR in both convex and strongly convex cases under smoothness and non-smoothness conditions of involved functions, which not only improve the best rates in previous works on smooth cFCCO problems but also expand the realm of cFCCO for solving more challenging non-smooth problems such as the dual form of GDRO. Finally, we derive lower complexity bounds, demonstrating the (near-)optimality of ALEXR within a broad class of stochastic algorithms for cFCCO. Experimental results on GDRO and partial Area Under the ROC Curve (pAUC) maximization demonstrate the promising performance of our algorithm.
Paperid:1528
Authors:Seng Pei Liew · Takuya Kato · Sho Takase
Abstract:
Pretraining large language models (LLMs) is resourceintensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored. Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets. Based on these findings, we provide guidance to scale upcycling, and establish conditions under which upcycling outperforms from-scratch trainings within budget constraints.
Paperid:1529
Authors:Yejiang Wang · Yuhai Zhao · Zhengkui Wang · Ling Li · Jiapu Wang · Fangting Li · Miaomiao Huang · Shirui Pan · Xingwei Wang
Abstract:
Node equivalence is common in graphs, encompassing automorphic equivalence (preserving adjacency under node permutations) and attribute equivalence (nodes with identical attributes). Despite their importance for learning node representations, these equivalences are largely ignored by existing graph models. To bridge this gap, we propose a GrAph selfsupervised Learning framework with Equivalence (GALE) and analyze its connections to existing techniques. Specifically, we: 1) unify automorphic and attribute equivalence into a single equivalence class; 2) enforce the equivalence principle to make representations within the same class more similar while separating those across classes; 3) introduce approximate equivalence classes with linear time complexity to address the NP-hardness of exact automorphism detection and handle node-feature variation; 4) analyze existing graph encoders, noting limitations in message passing neural networks and graph transformers regarding equivalence constraints; 5) show that graph contrastive learning paradigms are a degenerate form of equivalence constraint; and 6) demonstrate that GALE achieves superior performance over SOTA baselines.
Paperid:1530
Authors:Dongmin Bang · Inyoung Sung · Yinhua Piao · Sangseon Lee · Sun Kim
Abstract:
Abstract:The advent of generative AI now enables largescale $\textit{de novo}$ design of molecules, but identifying viable drug candidates among them remains an open problem. Existing drug-likeness prediction methods often rely on ambiguous negative sets or purely structural features, limiting their ability to accurately classify drugs from non-drugs. In this work, we introduce BounDrE: a novel modeling of drug-likeness as a compact space surrounding approved drugs through a dynamic one-class boundary approach. Specifically, we enrich the chemical space through biomedical knowledge alignment, and then iteratively tighten the drug-like boundary by pushing non-drug-like compounds outside via an Expectation-Maximization (EM)-like process. Empirically, BounDrE achieves 10\% F1-score improvement over the previous state-of-the-art and demonstrates robust cross-dataset performance, including zero-shot toxic compound filtering. Additionally, we showcase its effectiveness through comprehensive case studies in large-scale $\textit{in silico}$ screening. Our codes and constructed benchmark data under various schemes are provided at: https://anonymous.4open.science/r/boundr_e.
Paperid:1531
Authors:Sang-gil Lee · Zhifeng Kong · ARUSHI GOEL · Sungwon Kim · Rafael Valle · Bryan Catanzaro
Abstract:
Recent years have seen significant progress in TextTo-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA's improved ability to generate creative audio following complex and imaginative captions -- a task that is more challenging than current benchmarks.
Paperid:1532
Authors:Fanxu Meng · Pingzhi Tang · Fan Jiang · Muhan Zhang
Abstract:
Abstract:To adapt a welltrained large model to downstream tasks, learning within its original latent space by leveraging linear combinations of its basis vectors ensures stable training without compromising the model’s capabilities. However, constructing orthonormal bases from a matrix typically requires a transfer matrix, which significantly increases storage and computational overhead. In this paper, we propose a Cross-Layer approach to absorb $W_Q$ and $W_K$ into $W_{QK}$, and $W_V$ and $W_O$ into $W_{VO}$, where the ranks of $W_{QK}$ and $W_{VO}$ are bounded by the head dimension $d$. By performing singular value decomposition on $W_{QK}$ and $W_{VO}$ and removing singular vectors corresponding to zero singular values, we effectively%orthonormalize $W_Q$, $W_K$, $W_V$, and $W_O$ without requiring transfer matrices. This operation reduces the encoder attention parameters of Whisper-large-v3 by 46.42\% without requiring additional training.To enable efficient fine-tuning, we perform head-in or cross-head QR decomposition on $W_Q$, $W_K$, $W_V$, and $W_O$, fine-tuning only the matrices $R_Q$, $R_K$, $R_V$, and $R_O$. This approach allows a full-rank update through linear combinations of all orthonormal vectors. We evaluate our method by fine-tuning LLaMA-2-7B on eight commonsense reasoning datasets. The results demonstrate that our approach outperforms LoRA by 5.4\% and DoRA by 3.7\%, highlighting the effectiveness of the proposed method. And forgetting less previous knowledge when learning new knowledge.
Paperid:1533
Authors:Tennessee Hickling · Dennis Prangle
Abstract:
Normalizing flows are a flexible class of probability distributions, expressed as transformations of a simple base distribution. A limitation of standard normalizing flows is representing distributions with heavy tails, which arise in applications to both density estimation and variational inference. A popular current solution to this problem is to use a heavy tailed base distribution. We argue this can lead to poor performance due to the difficulty of optimising neural networks, such as normalizing flows, under heavy tailed input. We propose an alternative, "tail transform flow'' (TTF), which uses a Gaussian base distribution and a final transformation layer which can produce heavy tails. Experimental results show this approach outperforms current methods, especially when the target distribution has large dimension or tail weight.
Paperid:1534
Authors:Yihang Wang · Yuying Qiu · Peng Chen · Yang Shu · Zhongwen Rao · Lujia Pan · Bin Yang · Chenjuan Guo
Abstract:
Existing works on general time series forecasting build foundation models with heavy model parameters through largescale multi-source pertaining. These models achieve superior generalization ability across various datasets at the cost of significant computational burdens and limitations in resource-constrained scenarios. This paper introduces LightGTS, a lightweight general time series forecasting model designed from the perspective of consistent periodical modeling. To handle diverse scales and intrinsic periods in multi-source pre-training, we introduce Periodical Tokenization, which extracts consistent periodic patterns across different datasets with varying scales. To better utilize the periodicity in the decoding process, we further introduce Periodical Parallel Decoding, which leverage historical tokens to improve forecasting. Based on the two techniques above which fully leverage the inductive bias of periods inherent in time series, LightGTS uses a lightweight model to achieve outstanding performance on general time series forecasting. It achieves state-of-the-art forecasting performance on 9 real-world benchmarks in both zero-shot and full-shot setting with much better efficiency compared with existing time series foundation models
Paperid:1535
Authors:Bo Li · Fangxiao WANG · Xing Shiji
Abstract:
Abstract:We study the fair scheduling of jobs among groups of (unrelated) machines and focus on the maximin share (MMS) fairness at the group level. The problem was first introduced by Li et al. [NeurIPS 2023], where each group consists of a number of identical machines (or identical up to different speeds), and the cost of a group is determined by the minimum makespan on completing all jobs assigned to it. It is left as an open problem when the machines within each group are unrelated. In this paper, we first resolve this problem and design a polynomialtime algorithm that computes a 2-approximate MMS allocation via linear programming techniques. We complement this result with a hard instance, showing that no algorithm can be better than $(2-\frac{1}{n})$-approximate MMS, where $n$ is the number of machines. Thus the approximation ratio 2 is asymptotically tight. When the groups consist of identical machines, we improve the approximation ratio to $\frac{4}{3}$.
Paperid:1536
Authors:Vignesh Gopakumar · Ander Gray · Lorenzo Zanisi · Timothy Nunn · Stanislas Pamela · Daniel Giles · Matt Kusner · Marc Deisenroth
Abstract:
Neural PDEs offer efficient alternatives to computationally expensive numerical PDE solvers for simulating complex physical systems. However, their lack of robust uncertainty quantification (UQ) limits deployment in critical applications. We introduce a modelagnostic, physics-informed conformal prediction (CP) framework that provides guaranteed uncertainty estimates without requiring labelled data. By utilising a physics-based approach, we are able to quantify and calibrate the model's inconsistencies with the PDE rather than the uncertainty arising from the data. Our approach uses convolutional layers as finite-difference stencils and leverages physics residual errors as nonconformity scores, enabling data-free UQ with marginal and joint coverage guarantees across prediction domains for a range of complex PDEs. We further validate the efficacy of our method on neural PDE models for plasma modelling and shot design in fusion reactors.
Paperid:1537
Authors:Juan Correa · Elias Bareinboim
Abstract:
Graphical models have been widely used as parsimonious encoders of constraints of the underlying probability models. When organized in a structured way, these models can facilitate the derivation of nontrivial constraints, the inference of quantities of interest, and the optimization of their estimands. In particular, causal diagrams allow for the efficient representation of structural constraints of the underlying causal system. In this paper, we introduce an efficient graphical construction called Ancestral Multi-world Networks that is sound and complete for reading counterfactual independences from a causal diagram using d-separation. Moreover, we introduce the counterfactual (ctf-) calculus, which can be used to transform counterfactual quantities using three rules licensed by the constraints encoded in the diagram. This result generalizes Pearl’s celebrated do-calculus from interventional to counterfactual reasoning.
Paperid:1538
Authors:Hao Jiang · Qi Liu · Rui Li · Shengyu Ye · Shijin Wang
Abstract:
Large language models have been successfully applied to programming assistance tasks, such as code completion, code insertion, and instructional code editing. However, these applications remain insufficiently automated and struggle to effectively integrate various types of information during the programming process, including coding history, code context, and user instructions. In this work, we propose a new framework that comprehensively integrates these information sources, collect data to train our models and evaluate their performance. Firstly, to thoroughly evaluate how well models align with different types of information and the quality of their outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to comprehensively assess the performance of models in programming assistance tasks. Then, for data collection, we develop a data generation pipeline, ProgrammingInstruct, which synthesizes training data from diverse sources, such as GitHub and online judge platforms. This pipeline can automatically generate various types of messages throughout the programming process. Finally, using this pipeline, we generate 219K samples, fine-tune multiple models, and develop the CursorCore series. We show that CursorCore outperforms other models of comparable size. This framework unifies applications such as inline chat and automated editing, contributes to the advancement of coding assistants.
Paperid:1539
Authors:Victoria Lin · Louis-Philippe Morency · Eli Ben-Michael
Abstract:
As language technologies become widespread, it is important to understand how changes in language affect reader perceptions and behaviors. These relationships may be formalized as theisolated causal effectof somefocallanguageencoded intervention (e.g., misinformation) on an external outcome (e.g., readers' beliefs). In this paper, we introduce a formal estimation framework for isolated causal effects of language. We show that a core challenge of estimating isolated effects is the need to approximate allnon-focallanguage outside of the intervention. Drawing on the principle ofomitted variable bias, we provide measures for evaluating the quality of both non-focal language approximations and isolated effect estimates themselves. We find that poor approximation of non-focal language can lead to bias in the corresponding isolated effect estimates due to omission of relevant variables, and we show how to assess the sensitivity of effect estimates to such bias along the two key axes offidelityandoverlap. In experiments on semi-synthetic and real-world data, we validate the ability of our framework to correctly recover isolated effects and demonstrate the utility of our proposed measures.
Paperid:1540
Authors:Kushagra Pandey · Farrin Marouf Sofian · Felix Draxler · Theofanis Karaletsos · Stephan Mandt
Abstract:
Diffusion models exhibit excellent sample quality, but existing guidance methods often require additional model training or are limited to specific tasks. We revisit guidance in diffusion models from the perspective of variational inference and control, introducing Diffusion Trajectory Matching (DTM) that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves stateof-the-art results on various image inverse problems without requiring additional model training or modifications. For instance, in ImageNet non-linear deblurring, our model achieves an FID score of 34.31, significantly improving over the best pretrained-method baseline (FID 78.07).
Paperid:1541
Authors:Hai-Long Sun · Da-Wei Zhou · Yang Li · Shiyin Lu · Chao Yi · Qing-Guo Chen · Zhao Xu · Weihua Luo · Kaifu Zhang · De-Chuan Zhan · Han-Jia Ye
Abstract:
The rapid development of Multimodal Large Language Models (MLLMs), such as GPT4, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level. PARROT conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. Our method achieves state-of-the-art performance on both the multilingual MMBench and MMMB, as well as across a wide range of multimodal tasks.
Paperid:1542
Authors:Yuhui Wang · Qingyuan Wu · Dylan Ashley · Francesco Faccio · Weida Li · Chao Huang · Jürgen Schmidhuber
Abstract:
Abstract:The Value Iteration Network (VIN) is an endto-end differentiable neural network architecture for planning. It exhibits strong generalization to unseen domains by incorporating a differentiable planning module that operates on a latent Markov Decision Process (MDP). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a $100\times 100$ maze---a task that typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module's depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introduce an "adaptive highway loss" that constructs skip connections to improve gradient flow. We evaluate our method on 2D/3D maze navigation environments, continuous control, and the real-world Lunar rover navigation task. We find that our new method, named Dynamic Transition VIN (DT-VIN), scales to 5000 layers and solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in complex environments.
Paperid:1543
Authors:Yizhuo Chen · Chun-Fu (Richard) Chen · Hsiang Hsu · Shaohan Hu · Tarek Abdelzaher
Abstract:
The growing Machine Learning (ML) services require extensive collections of user data, which may inadvertently include people's private information irrelevant to the services. Various studies have been proposed to protect private attributes by removing them from the data while maintaining the utilities of the data for downstream tasks. Nevertheless, as we theoretically and empirically show in the paper, these methods reveal severe vulnerability because of a common weakness rooted in their adversarial training based strategies. To overcome this limitation, we propose a novel approach, PASS, designed to stochastically substitute the original sample with another one according to certain probabilities, which is trained with a novel loss function soundly derived from informationtheoretic objective defined for utility-preserving private attributes protection. The comprehensive evaluation of PASS on various datasets of different modalities, including facial images, human activity sensory signals, and voice recording datasets, substantiates PASS's effectiveness and generalizability.
Paperid:1544
Authors:Xinyi Wu · Yifei Wang · Stefanie Jegelka · Ali Jadbabaie
Abstract:
Recent studies have revealed various manifestations of position bias in transformer architectures, from the "lostin-the-middle" phenomenon to attention sinks, yet a comprehensive theoretical understanding of how attention masks and positional encodings shape these biases remains elusive. This paper introduces a novel graph-theoretic framework to analyze position bias in multi-layer attention. Modeling attention masks as directed graphs, we quantify how tokens interact with contextual information based on their sequential positions. We uncover two key insights: First, causal masking inherently biases attention toward earlier positions, as tokens in deeper layers attend to increasingly more contextualized representations of earlier tokens. Second, we characterize the competing effects of the causal mask and relative positional encodings, such as the decay mask and rotary positional encoding (RoPE): while both mechanisms introduce distance-based decay within individual attention maps, their aggregate effect across multiple attention layers – coupled with the causal mask – leads to a trade-off between the long-term decay effects and the cumulative importance of early sequence positions. Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs. Our framework offers a principled foundation for understanding positional biases in transformers, shedding light on the complex interplay of attention mechanism components and guiding more informed architectural design.
Paperid:1545
Authors:Lukas Braun · Erin Grant · Andrew Saxe
Abstract:
A foundational principle of connectionism is that perception, action, and cognition emerge from parallel computations among simple, interconnected units that generate and rely on neural representations. Accordingly, researchers employ multivariate pattern analysis to decode and compare the neural codes of artificial and biological networks, aiming to uncover these networks’ underlying functions. However, there is limited analytical understanding of the relationship between representation and function in connectionist systems. Here, we analytically study the solution manifold of twolayer linear networks, demonstrating that function and representation are dissociated. Numerical simulations further show that this dissociation persists in nonlinear networks. We prove that among networks implementing the same function, those that are robust to noise and support generalisation and transfer employ task-specific neural representations. Based on these findings, we hypothesize that representational alignment emerges from computational advantages rather than functional alignment alone, with significant implications for interpreting and comparing the representations of connectionist systems.
Paperid:1546
Authors:Lexiang Hu · Yisen Wang · Zhouchen Lin
Abstract:
Abstract:KolmogorovArnold Networks (KANs) have seen great success in scientific domains thanks to spline activation functions, becoming an alternative to Multi-Layer Perceptrons (MLPs). However, spline functions may not respect symmetry in tasks, which is crucial prior knowledge in machine learning. In this paper, we propose Equivariant Kolmogorov-Arnold Networks (EKAN), a method for incorporating arbitrary matrix group equivariance into KANs, aiming to broaden their applicability to more fields. We first construct gated spline basis functions, which form the EKAN layer together with equivariant linear weights, and then define a lift layer to align the input space of EKAN with the feature space of the dataset, thereby building the entire EKAN architecture. Compared with baseline models, EKAN achieves higher accuracy with smaller datasets or fewer parameters on symmetry-related tasks, such as particle scattering and the three-body problem, often reducing test MSE by several orders of magnitude. Even in non-symbolic formula scenarios, such as top quark tagging with three jet constituents, EKAN achieves comparable results with state-of-the-art equivariant architectures using fewer than $40\\%$ of the parameters, while KANs do not outperform MLPs as expected.
Paperid:1547
Authors:Lian Liu · Xiandong Zhao · Guanchen Li · Dong Li · Wang · Yinhe Han · Xiaowei Li · ying wang
Abstract:
Oneshot post-training pruning enhances the deployment of billion-scale large language models (LLMs), with the pruning metric playing a pivotal role in determining which weights to remove. However, existing metrics underperform due to their reliance on a simple symbolic combination of weights and activations, overlooking imbalanced weight magnitudes and the disproportionate influence of activation outliers.To overcome these limitations, we introduce \textbf{BaWA}, a novel pruning metric that systematically \underline{Ba}lances \underline{W}eight and \underline{A}ctivation distributions for more effective pruning.BaWA introduces two key innovations: \textbf{magnitude normalization}, which mitigates weight imbalance across channels for fairer pruning decisions, and \textbf{outlier regularization}, which reduces the impact of activation outliers, ensuring more appropriate channel prioritization. To further enhance its effectiveness, BaWA incorporates an efficient and automatic framework for optimizing normalization and regularization hyperparameters. Extensive experiments validate BaWA as a state-of-the-art (SOTA) pruning metric. For instance, applying BaWA to induce 2:4 sparsity in Mistral-7B reduces perplexity in language comprehension by 2.49 and improves average downstream task accuracy by 3.08\%, outperforming the previous SOTA method Wanda.
Paperid:1548
Authors:Haozhe Ma · Fangling Li · Jing Lim · Zhengding Luo · Thanh Vinh Vo · Tze-Yun Leong
Abstract:
Existing reward shaping techniques for sparsereward reinforcement learning generally fall into two categories: novelty-based exploration bonuses and significance-based hidden state values. The former promotes exploration but can lead to distraction from task objectives, while the latter facilitates stable convergence but often lacks sufficient early exploration. To address these limitations, we propose Dual Random Networks Distillation (DuRND), a novel reward shaping framework that efficiently balances exploration and exploitation in a unified mechanism. DuRND leverages two lightweight random network modules to simultaneously compute two complementary rewards: a novelty reward to encourage directed exploration and a contribution reward to assess progress toward task completion. With low computational overhead, DuRND excels in high-dimensional environments with challenging sparse rewards, such as Atari, VizDoom, and MiniWorld, outperforming several benchmarks.
Paperid:1549
Authors:Yana Wei · Liang Zhao · Kangheng Lin · En Yu · Yuang Peng · Runpei Dong · Jianjian Sun · Haoran Wei · Zheng Ge · Xiangyu Zhang · Vishal Patel
Abstract:
We present a perception in reflection paradigm designed to transcend the limitations of current large visionlanguage models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visual perception. This framework is powered by Reflective Perceptual Learning (RPL), which reinforces intrinsic reflective capabilities through a methodically constructed visual reflection dataset and reflective unlikelihood training Comprehensive experimental evaluation demonstrates RePer's quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, RePer achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation.
Paperid:1550
Authors:Anqi Mao · Mehryar Mohri · Yutao Zhong
Abstract:
Abstract:In applications with significant class imbalance or asymmetric costs, metrics such as the $F_\beta$measure, AM measure, Jaccard similarity coefficient, and weighted accuracy offer more suitable evaluation criteria than standard binary classification loss. However, optimizing these metrics present significant computational and statistical challenges. Existing approaches often rely on the characterization of the Bayes-optimal classifier, and use threshold-based methods that first estimate class probabilities and then seek an optimal threshold. This leads to algorithms that are not tailored to restricted hypothesis sets and lack finite-sample performance guarantees. In this work, we introduce principled algorithms for optimizing generalized metrics, supported by $H$-consistency and finite-sample generalization bounds. Our approach reformulates metric optimization as a generalized cost-sensitive learning problem, enabling the design of novel surrogate loss functions with provable $H$-consistency guarantees. Leveraging this framework, we develop new algorithms, METRO (*Metric Optimization*), with strong theoretical performance guarantees. We report the results of experiments demonstrating the effectiveness of our methods compared to prior baselines.
Paperid:1551
Authors:Xiangzhe Xu · Zian Su · Jinyao Guo · Kaiyuan Zhang · Zhenting Wang · Xiangyu Zhang
Abstract:
Recent advances in codespecific large language models (LLMs) have greatly enhanced code generation and refinement capabilities. However, the safety of code LLMs remains under-explored, posing potential risks as insecure code generated by these models may introduce vulnerabilities into real-world systems. Previous work proposes to collect security-focused instruction-tuning dataset from real-world vulnerabilities. It is constrained by the data sparsity of vulnerable code, and has limited applicability in the iterative post-training workflows of modern LLMs. In this paper, we propose ProSec, a novel proactive security alignment approach designed to align code LLMs with secure coding practices. ProSec systematically exposes the vulnerabilities in a code LLM by synthesizing error-inducing coding scenarios from Common Weakness Enumerations (CWEs), and generates fixes to vulnerable code snippets, allowing the model to learn secure practices through advanced preference learning objectives. The scenarios synthesized by ProSec triggers 25 times more vulnerable code than a normal instruction-tuning dataset, resulting in a security-focused alignment dataset 7 times larger than the previous work. Experiments show that models trained with ProSec are 25.2\% to 91.4\% more secure compared to previous work without degrading models' utility.
Paperid:1552
Authors:Ju-Seung Byun · Andrew Perrault
Abstract:
Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) introduce additional challenges. For instance, diverse preferences complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. These RL challenges create confusion about whether the probability of an action for a given state should be increased or decreased, similar to the noise in labels for classification tasks. In this work, we focus on RL algorithms that share learning difficulties with crossentropy loss, especially for low-probability predictions. To enhance stability, we adapt reverse cross-entropy (RCE) from supervised learning for noisy data, defining a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO). Notably, SPPO shows strong performance across different hyperparameters. Furthermore, we validate the symmetric RL loss in the RLHF framework using PPO for natural language processing tasks such as IMDB positive sentiment and TL;DR summarization.
Paperid:1553
Authors:Junhyuck Kim · Jong Ho Park · Jaewoong Cho · Dimitris Papailiopoulos
Abstract:
We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that keyvalue cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of ~4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.
Paperid:1554
Authors:Jiarui Jin · Yuwei Wu · Xiaoting He · Haoxuan Li · Weinan Zhang · Yiming Yang · Yong Yu · Jun Wang · Mengyue Yang
Abstract:
Incontext learning with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training dataset. However, previous few-shot in-context learning methods, which calculate similarity scores for choosing demonstrations, incur high computational costs by repeatedly retrieving large-scale datasets for each query. This is due to their failure to recognize that not all demonstrations are equally informative, and many less informative demonstrations can be inferred from a core set of highly informative ones. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel \emph{pre-selection} framework that identifies a core subset of demonstrations containing the most informative examples. This subset, referred to as the FEEDER set, consists of demonstrations that capture both the ''sufficiency'' and ''necessity'' information to infer the entire dataset. Notice that FEEDER is selected before the few-shot in-context learning, enabling more efficient few-shot demonstrations choosing in a smaller set. To identify FEEDER, we propose a novel effective tree based algorithm. Once selected, it can replace the original dataset, leading to improved efficiency and prediction accuracy in few-shot in-context learning. Additionally, FEEDER also benefit fine-tuning LLMs, we propose a bi-level optimization method enabling more efficient training without sacrificing performance when datasets become smaller. Our experiments are on 6 text classification datasets, 1 reasoning dataset, and 1 semantic-parsing dataset, across 6 LLMs (ranging from 335M to 7B parameters), demonstrate that: (i) In few-shot inference, FEEDER achieves superior (or comparable) performance while utilizing only half the input training data. (ii) In fine-tuning, FEEDER significantly boosts the performance of LLMs.
Paperid:1555
Authors:Xingcheng Zhou · Konstantinos Larintzakis · Hao Guo · Walter Zimmer · Mingyu Liu · Hu Cao · Jiajie Zhang · Venkatnarayanan Lakshminarasimhan · Leah Strand · Alois Knoll
Abstract:
We present TraffiXVideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TraffiX-VideoQA unifies three essential tasks—multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding—within a cohesive evaluation framework. We further introduce the TraffiX-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset’s complexity, highlight the limitations of existing models, and position TraffiX-VideoQA as a robust foundation for advancing research in intelligent transportation systems. The dataset and benchmark are publicly available to facilitate further exploration.
Paperid:1556
Authors:Haoyang Li · Xin Wang · Xueling Zhu · Weigao Wen · Wenwu Zhu
Abstract:
Graph neural networks have been widely studied in both academia and industry. Most of the popular graph neural networks are developed with the indistribution hypothesis, failing to generalize to out-of-distribution environments, whose performances can degrade significantly under distribution shifts. To tackle this problem, some graph invariant learning methods are proposed to learn invariant subgraph against distribution shifts, which heavily rely on predefined or automatically generated environment labels. However, directly annotating or estimating such environment labels from biased graph data is usually impractical or even incorrect for real-world graph datasets. For example, when learning invariant subgraphs on an imbalanced graph dataset, GNNs could be biased towards variant patterns, resulting in poor out-of-distribution (OOD) generalization. In this paper, we propose to learn disentangled invariant subgraph via self-supervised contrastive variant subgraph estimation for achieving satisfying OOD generalization. Specifically, we first propose a GNN-based invariant subgraph generator to disentangle the invariant and variant subgraph for each input graph. Then, we propose to estimate the degree of the spurious correlations by conducting self-supervised contrastive learning on variant subgraphs. Thanks to the accurate identification and estimation of the variant subgraphs, we can capture invariant subgraphs accurately and further get rid of spurious correlations by adopting inverse propensity score to reweight the invariant subgraph predictions. Extensive experiments show the superiority of our proposed method over several state-of-the-art baselines.
Paperid:1557
Authors:Kaifeng Gao · Jiaxin Shi · Hanwang Zhang · Chunping Wang · Jun Xiao · Long Chen
Abstract:
With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate realworld applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose Ca2-VDM, an efficient autoregressive VDM with Causal generation and Cache sharing. For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Causal-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed.
Paperid:1558
Authors:Sungwon Kim · Namkyeong Lee · Yunyoung Doh · Seungmin Shin · Guimok Cho · Seung-Won Jeon · Sangkook Kim · Chanyoung Park
Abstract:
Mesh graphbased 3D static analysis methods have recently emerged as efficient alternatives to traditional computational numerical solvers, significantly reducing computational costs and runtime for various physics-based analyses. However, these methods primarily focus on surface topology and geometry, often overlooking the inherent thickness of real-world 3D objects, which exhibits high correlations and similar behavior between opposing surfaces. This limitation arises from the disconnected nature of these surfaces and the absence of internal edge connections within the mesh. In this work, we propose a novel framework, the Thickness-aware E(3)-Equivariant Mesh Neural Network (T-EMNN), that effectively integrates the thickness of 3D objects while maintaining the computational efficiency of surface meshes. Additionally, we introduce data-driven coordinates that encode spatial information while preserving E(3)-equivariance and invariance properties, ensuring consistent and robust analysis. Evaluations on a real-world industrial dataset demonstrate the superior performance of T-EMNN in accurately predicting node-level 3D deformations, effectively capturing thickness effects while maintaining computational efficiency.
Paperid:1559
Authors:Hongsheng Li · Yu Qi · Renrui Zhang · Liuhui Wang · Dongzhi Jiang · Xinyan Chen · Claire Guo · Jianhan Jin · Yanwei Li · Bo Zhang · Chaoyou Fu · Peng Gao · Ziyu Guo · Shen Yan
Abstract:
Answering questions with Chainof-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level. Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: 1) The models with reflection mechanism, such as QVQ, demonstrate a superior CoT quality, approaching GPT-4o; 2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and 3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs.
Paperid:1560
Authors:Le-Trung Nguyen · Aël Quélennec · Van-Tam Nguyen · Enzo Tartaglione
Abstract:
Abstract:Ondevice learning has emerged as a promising direction for AI development, particularly because of its potential to reduce latency issues and mitigate privacy risks associated with device-server communication, while improving energy efficiency. Despite these advantages, significant memory and computational constraints still represent major challenges for its deployment. Drawing on previous studies on low-rank decomposition methods that address activation memory bottlenecks in backpropagation, we propose a novel shortcut approach as an alternative. Our analysis and experiments demonstrate that our method can reduce activation memory usage, even up to $120.09\times$ compared to vanilla training, while also reducing overall training FLOPs up to $1.86\times$ when evaluated on traditional benchmarks.
Paperid:1561
Authors:Federico Errica · Henrik Christiansen · Viktor Zaverkin · Takashi Maruyama · Mathias Niepert · Francesco Alesiani
Abstract:
Longrange interactions are essential for the correct description of complex systems in many scientific fields. The price to pay for including them in the calculations, however, is a dramatic increase in the overall computational costs. Recently, deep graph networks have been employed as efficient, data-driven models for predicting properties of complex systems represented as graphs. These models rely on a message passing strategy that should, in principle, capture long-range information without explicitly modeling the corresponding interactions. In practice, most deep graph networks cannot really model long-range dependencies due to the intrinsic limitations of (synchronous) message passing, namely oversmoothing, oversquashing, and underreaching. This work proposes a general framework thatlearns to mitigatethese limitations: within a variational inference framework, we endow message passing architectures with the ability to adapt their depth and filter messages along the way. With theoretical and empirical arguments, we show that this strategy better captures long-range interactions, by surpassing the state of the art on five node and graph prediction datasets suited for this problem.
Paperid:1562
Authors:Jacob Mitchell Springer · Sachin Goyal · Kaiyue Wen · Tanishq Kumar · Xiang Yue · Sadhika Malladi · Graham Neubig · Aditi Raghunathan
Abstract:
Scaling up pretraining compute has been a key driver of recent advancements in language model capabilities. While the training-compute-optimal strategy suggests proportionally scaling both model size and pre-training tokens, a recent trend has emerged where models are overtrained on significantly more tokens to reduce inference costs by serving smaller models. We demonstrate in this work an important but previously undocumented consequence of this common design choice. While pre-training loss indeed decreases monotonically with overtraining on more tokens, we provide both empirical and theoretical evidence that extreme overtraining of language models can make them harder to fine-tune. Concretely, this manifests as language models overtrained to extreme token budgets performing worse after fine-tuning---both in-distribution and out-of-distribution---than models trained on moderate token budgets. We provide a theoretical characterization of this finding for linear models. Altogether, our results provide evidence for the surprising fact that larger amounts of pre-training compute may not always result in stronger models after post-training, and that there may be a tradeoff between longer pre-training and downstream adaptability of the resulting model.
Paperid:1563
Authors:HE LI · Haoang Chi · Mingyu Liu · Wanrong Huang · Liyang Xu · Wenjing Yang
Abstract:
The real world naturally has dimensions of time and space. Therefore, estimating the counterfactual outcomes with spatialtemporal attributes is a crucial problem. However, previous methods are based on classical statistical models, which still have limitations in performance and generalization. This paper proposes a novel framework for estimating counterfactual outcomes with spatial-temporal attributes using the Transformer, exhibiting stronger estimation ability. Under mild assumptions, the proposed estimator within this framework is consistent and asymptotically normal. To validate the effectiveness of our approach, we conduct simulation experiments and real data experiments. Simulation experiments show that our estimator has a stronger estimation capability than baseline methods. Real data experiments provide a valuable conclusion to the causal effect of conflicts on forest loss in Colombia.
Paperid:1564
Authors:William Huey · Huaxiaoyue Wang · Anne Wu · Yoav Artzi · Sanjiban Choudhury
Abstract:
Abstract:We examine the problem of learning sequential tasks from a single visual demonstration.A key challenge arises when demonstrations are temporally misaligned due to variations in timing, differences in embodiment, or inconsistencies in execution. Existing approaches treat imitation as a distributionmatching problem, aligning individual frames between the agent and the demonstration. However, we show that such frame-level matching fails to enforce temporal ordering or ensure consistent progress.Our key insight is that matching should instead be defined at the level of sequences. We propose that perfect matching occurs when one sequence successfully covers all the subgoals in the same order as the other sequence. We present ORCA (ORdered Coverage Alignment), a dense per-timestep reward function that measures the probability of the agent covering demonstration frames in the correct order. On temporally misaligned demonstrations, we show that agents trained with the ORCA reward achieve $4.5$x improvement ($0.11 \rightarrow 0.50$ average normalized returns) for Meta-world tasks and $6.6$x improvement ($6.55 \rightarrow 43.3$ average returns) for Humanoid-v4 tasks compared to the best frame-level matching algorithms. We also provide empirical analysis showing that ORCA is robust to varying levels of temporal misalignment.
Paperid:1565
Authors:Muhammed Göktepe · Amir Hossein Shamseddin · Erencan Uysal · Javier Monteagudo · Lukas Drees · Aysim Toker · Senthold Asseng · Malte von Bloh
Abstract:
Satellite imagery is essential for Earth observation, enabling applications like crop yield prediction, environmental monitoring, and climatechange assessment. However, integrating satellite imagery with climate data remains a challenge, limiting its utility for forecasting and scenario analysis. We introduce a novel dataset of 2.7 million Sentinel2 images spanning 15 land cover types with corresponding climate records, forming the foundation for two satellite image generation approaches using fine-tuned Stable Diffusion 3 models. The first is a text-to-image generation model that uses textual prompts with climate and land cover details to produce realistic synthetic imagery for specific regions. The second leverages ControlNet for multi-conditional image generation, preserving spatial structures while mapping climate data or generating time-series to simulate landscape evolution. By combining synthetic image generation with climate and land cover data, our work advances generative modeling in remote sensing, offering realistic inputs for environmental forecasting and new possibilities for climate adaptation and geospatial analysis.
Paperid:1566
Authors:Floor Eijkelboom · Heiko Zimmermann · Sharvaree Vadgama · Erik Bekkers · Max Welling · Christian Andersson Naesseth · Jan-Willem van de Meent
Abstract:
We derive a controlled generation objective within the framework of Variational Flow Matching (VFM),which casts flow matching as a variational inference problem.We demonstrate that controlled generation can be implemented two ways: (1) by way of endto-end training of conditional generative models, or (2) as a Bayesian inference problem, enabling post hoc control of unconditional models without retraining. Furthermore, we establish the conditions required for equivariant generation and provide an equivariant formulation of VFM tailored for molecular generation, ensuring invariance to rotations, translations, and permutations. We evaluate our approach on both uncontrolled and controlled molecular generation, achieving state-of-the-art performance on uncontrolled generation and outperforming state-of-the-art models in controlled generation, both with end-to-end training and in the Bayesian inference setting. This work strengthens the connection between flow-based generative modeling and Bayesian inference, offering a scalable and principled framework for constraint-driven and symmetry-aware generation.
Paperid:1567
Authors:Christopher Subich · Syed Husain · Leo Separovic · Jing Yang
Abstract:
Recent advancements in datadriven weather forecasting models have delivered deterministic models that outperform the leading operational forecast systems based on traditional, physics-based models. However, these data-driven models are typically trained with a mean squared error loss function, which causes smoothing of fine scales through a ``double penalty'' effect. We develop a simple, parameter-free modification to this loss function that avoids this problem by separating the loss attributable to decorrelation from the loss attributable to spectral amplitude errors. Fine-tuning the GraphCast model with this new loss function results in sharp deterministic weather forecasts, an increase of the model's effective resolution from 1,250km to 160km, improvements to ensemble spread, and improvements to predictions of tropical cyclone strength and surface wind extremes.
Paperid:1568
Authors:Yun Hua · Haosheng Chen · Wenhao Li · Bo Jin · Baoxiang Wang · Hongyuan Zha · Xiangfeng Wang
Abstract:
Addressing reward design complexities in deep reinforcement learning is facilitated by knowledge transfer across different domains. To this end, we define \textit{reward translation} to describe the crossdomain reward transfer problem. However, current methods struggle with non-pairable and non-time-alignable incompatible MDPs.This paper presents an adaptable reward translation framework \textit{neural reward translation} featuring \textit{semi-alignable MDPs}, which allows efficient reward translation under relaxed constraints while handling the intricacies of incompatible MDPs. Given the inherent difficulty of directly mapping semi-alignable MDPs and transferring rewards, we introduce an indirect mapping method through reward machines, created using limited human input or LLM-based automated learning.Graph-matching techniques establish links between reward machines from distinct environments, thus enabling cross-domain reward translation within semi-alignable MDP settings. This broadens the applicability of DRL across multiple domains. Experiments substantiate our approach's effectiveness in tasks under environments with semi-alignable MDPs.
Paperid:1569
Authors:Chase Goddard · Lindsay Smith · Wave Ngampruetikorn · David Schwab
Abstract:
Incontext learning (ICL) is a remarkable capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples. We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize out-of-distribution. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset -- here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition.
Paperid:1570
Authors:Rom N. Parnichkun · Neehal Tumma · Armin Thomas · Alessandro Moro · Qi An · Taiji Suzuki · Atsushi Yamashita · Michael Poli · Stefano Massaroli
Abstract:
As the space of causal sequence modeling architectures continues to grow, the need to develop a general framework for their analysis becomes increasingly important. With this aim, we draw insights from classical signal processing and control theory, to develop a quantitative measure of memory utilization: the internal mechanisms through which a model stores past information to produce future outputs. This metric, which we calleffective statesize (ESS), is tailored to the fundamental class of systems with input-invariant and input-varying linear operators, encompassing a variety of computational units such as variants of attention, convolutions, and recurrences. Unlike prior work on memory utilization, which either relies on raw operator visualizations (e.g. attention maps), or simply the total memory capacity (i.e. cache size) of a model, our metrics provide highly interpretable and actionable measurements. In particular, we show how ESS can be leveraged to improve initialization strategies, inform novel regularizers and advance the performance-efficiency frontier through model distillation. Furthermore, we demonstrate that the effect of context delimiters (such as end-of-speech tokens) on ESS highlights cross-architectural differences in how large language models utilize their available memory to recall information. Overall, we find that ESS provides valuable insights into the dynamics that dictate memory utilization, enabling the design of more efficient and effective sequence models.
Paperid:1571
Authors:Andrey Kofnov · Daniel Kapla · Ezio Bartocci · Efstathia Bura
Abstract:
We derive exact upper and lower bounds for the cumulative distribution function (cdf) of the output of a neural network over its entire support subject to noisy (stochastic) inputs. The upper and lower bounds converge to the true cdf over its domain as the resolution increases. Our method applies to any feedforward NN using continuous monotonic piecewise differentiable activation functions (e.g., ReLU, tanh and softmax) and convolutional NNs, which were beyond the scope of competing approaches. The novelty and an instrumental tool of our approach is to bound general NNs with ReLU NNs. The ReLU NN based bounds are then used to derive upper and lower bounds of the cdf of the NN output. Experiments demonstrate that our method delivers guaranteed bounds of the predictive output distribution over its support, thus providing exact error guarantees, in contrast to competing approaches.
Paperid:1572
Authors:Stathi Fotiadis · Noah Brenowitz · Tomas Geffner · Yair Cohen · Michael Pritchard · Arash Vahdat · Morteza Mardani
Abstract:
Conditional diffusion and flow models are effective for superresolving small-scale details in natural images. However, in physical sciences such as weather, three major challenges arise: (i) spatially misaligned input-output distributions (PDEs at different resolutions lead to divergent trajectories), (ii) misaligned and distinct input-output channels (channel synthesis), (iii) several channels with diverse stochasticity scales (multiscale). To address these, we propose to first encode inputs into a latent base distribution that is closer to the target, then apply Flow Matching to generate small-scale physics. The encoder captures deterministic components, while Flow Matching adds stochastic details. To handle uncertainty in the deterministic part, we inject noise via an adaptive noise scaling mechanism, dynamically adjusted by maximum-likelihood estimates of the encoder’s predictions. Experiments on real-world weather data (including super-resolution from 25 km to 2 km scales in Taiwan) and in synthetic Kolmogorov flow datasets show that our proposed Adaptive Flow Matching (AFM) framework outperforms existing methods and produces better-calibrated ensembles.
Paperid:1573
Authors:Lukas Helff · Felix Friedrich · Manuel Brack · Kristian Kersting · Patrick Schramowski
Abstract:
This paper introduces Llavaguard, a suite of VLMbased vision safeguards that address the critical need for reliable tools in the era of large-scale data and models. To this end, we establish a novel open framework, describing a customizable safety taxonomy, data preprocessing, augmentation, and training setup. For teaching a VLM safeguard on safety, we further create a multimodal safety dataset with high-quality human expert annotations, where each image is labeled with a safety rating, category and rationale. We also employ advanced augmentations to support context-specific assessments. The resulting Llavaguard models, ranging from 0.5B to 7B, serve as a versatile tool for evaluating the safety compliance of visual content against flexible policies. In comprehensive experiments, Llavaguard outperforms both state-of-the-art safeguards and VLMs in accuracy \textit{and} in flexibly handling different policies. Additionally, we demonstrate Llavaguard's performance in two real-world applications: large-scale dataset annotation and moderation of text-to-image models. We make our entire framework publicly available\footnote{\url{https://anonymous.4open.science/r/LlavaGuard-FE7F}}, including the dataset and model weights.
Paperid:1574
Authors:Mingyuan Li · Jiahao Wang · Wei Ni · Guangsheng Yu · Xu Wang · Haipeng Peng · Lixiang Li · Qianrun Chen
Abstract:
Reinforcement Learning (RL) helps reduce congestion and emissions in Traffic Signal Control (TSC). However, realworld TSC systems often encounter issues like adversarial attacks and missing data, which could lead to incorrect signal decisions and increased congestion. Current methods can only address one of these problems, often by using offline data to predict real-time data. They fail to meet the dynamic and real-time online requirements critical for TSC. In this paper, we propose a novel framework called "RobustLight", which includes an innovative, plug-and-play, improved diffusion model that can enhance the robustness of existing TSC platforms against data noise, missing values, and complex patterns. This is achieved by restoring data from attacks. While keeping the TSC platforms intact, RobustLight contains two algorithms, namely, denoise and repaint algorithms, to help the TSC platforms recover the original state from noise attack and also restore the missing data in the original state. RobustLight uses a new dynamic state infilling algorithm to train an improved diffusion model online. The denoise and repaint algorithms use the trained diffusion model to address the issues of adversarial attacks and missing data. Extensive experiments on various real-world datasets validate the recovery performance that RobustLight can offer, with an improvement of up to 50.43\% compared to the case in which RobustLight is not used. RobustLight enables TSC systems to effectively defend against a wide range of adversarial attacks and missing data. The relevant datasets and code are available at anonymous GitHub.
Paperid:1575
Authors:William Laplante · Matias Altamirano · Andrew Duncan · Jeremias Knoblauch · Francois-Xavier Briol
Abstract:
Statespace formulations allow for Gaussian process (GP) regression with linear-in-time computational cost in spatio-temporal settings, but performance typically suffers in the presence of outliers. In this paper, we adapt and specialise therobust and conjugate GP (RCGP)framework of Altamirano et al. (2024) to the spatio-temporal setting. In doing so, we obtain an outlier-robust spatio-temporal GP with a computational cost comparable to classical spatio-temporal GPs. We also overcome the three main drawbacks of RCGPs: their unreliable performance when the prior mean is chosen poorly, their lack of reliable uncertainty quantification, and the need to carefully select a hyperparameter by hand. We study our method extensively in finance and weather forecasting applications, demonstrating that it provides a reliable approach to spatio-temporal modelling in the presence of outliers.
Paperid:1576
Authors:Adam Block · Alexander Rakhlin · Ayush Sekhari
Abstract:
Watermarking, the process by which Large Language Model (LLM) servers imbed an imperceptible signal at inference time in order to detect text generated by their own models, has grown in importance due to the significant improvements in natural language processing tasks by modern LLMs. Current approaches are often impractical due to generation latency, detection time, degradation in text quality, or robustness; such problems often arise due to the focus on token level watermarking, which ignores the inherent structure of text. In this work, we introduce a new scheme, GaussMark, that is simple and efficient to implement, has formal statistical guarantees, comes at no cost in generation latency, and embeds the watermark into the weights of the model itself, providing a structural watermark. Our approach is based on Gaussian independence testing and is motivated by recent empirical observations that minor additive corruptions to LLM weights can result in models of identical (or even improved) quality. We provide formal statistical bounds on the validity and power of our procedure and, through an extensive suite of experiments, demonstrate that GaussMark is reliable, efficient, relatively robust to corruption, and can be instantiated with essentially no loss in model quality.
Paperid:1577
Authors:Jayadev Naram · Fredrik Hellström · Ziming Wang · Rebecka Jörnsten · Giuseppe Durisi
Abstract:
In many scenarios of practical interest, labeled data from a target distribution are scarce while labeled data from a related source distribution are abundant. One particular setting of interest arises when the target label space is a subset of the source label space, leading to the framework of partial domain adaptation (PDA). Typical approaches to PDA involve minimizing a domain alignment term and a weighted empirical loss on the source data, with the aim of transferring knowledge between domains. However, a theoretical basis for this procedure is lacking, and in particular, most existing weighting schemes are heuristic. In this work, we derive generalization bounds for the PDA problem based on partial optimal transport. These bounds corroborate the use of the partial Wasserstein distance as a domain alignment term, and lead to theoretically motivated explicit expressions for the empirical source loss weights. Inspired by these bounds, we devise a practical algorithm for PDA, termed WARMPOT. Through extensive numerical experiments, we show that WARMPOT is competitive with recent approaches, and that our proposed weights improve on existing schemes.
Paperid:1578
Authors:Jinzhao Li · Nan Jiang · Yexiang Xue
Abstract:
Satisfiability Modulo Counting (SMC) is a recently proposed general language to reason about problems integrating statistical and symbolic artificial intelligence. An SMC formula is an extended SAT formula in which the truth values of a few Boolean variables are determined by probabilistic inference. Existing approximate solvers optimize surrogate objectives which lack formal guarantees. Current exact solvers directly integrate SAT solvers and probabilistic inference solvers resulting in slow performance because of many backand-forth invocations of both solvers. We propose KOCO-SMC, an integrated exact SMC solver that efficiently tracks lower and upper bounds in the probabilistic inference process. It enhances computational efficiency by enabling early estimation of probabilistic inference using only partial variable assignments, whereas existing methods require full variable assignments.In experiment, we compare KOCO-SMC with currently available approximate and exact SMC solvers on large-scale datasets and real-world applications. Our approach delivers high-quality solutions with high efficiency.
Paperid:1579
Authors:Yusheng Zhao · Chi Zhang · Yuxuan Du
Abstract:
Abstract:Characterizing the ground state properties of quantum systems is fundamental to capturing their behavior but computationally challenging. Recent advances in AI have introduced novel approaches, with diverse machine learning (ML) and deep learning (DL) models proposed for this purpose. However, the necessity and specific role of DL models in these tasks remain unclear, as prior studies often employ varied or impractical quantum resources to construct datasets, resulting in unfair comparisons. To address this, we systematically benchmark DL models against traditional ML approaches across three families of Hamiltonian, scaling up to $127$ qubits in three crucial groundstate learning tasks while enforcing equivalent quantum resource usage. Our results reveal that ML models often achieve performance comparable to or even exceeding that of DL approaches across all tasks. Furthermore, a randomization test demonstrates that measurement input features have minimal impact on DL models' prediction performance. These findings challenge the necessity of current DL models in many quantum system learning scenarios and provide valuable insights into their effective utilization.
Paperid:1580
Authors:Qianqian Wang · Mengping Jiang · Zhengming Ding · Quanxue Gao
Abstract:
Abstract:Kmeans clustering is widely recognized as an effective unsupervised learning method due to its simplicity and computational efficiency. However, it faces notable challenges, including sensitivity to initial centroid selection, a limited ability to discern the intrinsic manifold structures within nonlinear datasets, and difficulty in achieving balanced clustering in practical scenarios. To surmount these obstacles, we introduce a novel framework for K-means that leverages manifold learning principles. This approach eliminates the need for centroid initialization and utilizes a label matrix to generate a similarity matrix, thereby aligning manifold structures with class labels and enhancing clustering accuracy and robustness. Beyond the traditional Euclidean distance, our model incorporates Gaussian kernel distance, K-nearest neighbor distance, and low-pass filtering distance to effectively manage data that is not linearly separable. Furthermore, we introduce the $\ell_{2,1}$-norm regularizer to promote balanced clustering outcomes. Comprehensive experimental results substantiate the efficacy of our proposed methodology.
Paperid:1581
Authors:Chenning Yu · Sicun Gao
Abstract:
We introduce a novel resampling criterion using lift scores, for improving compositional generation in diffusion models. By leveraging the lift scores, we evaluate whether generated samples align with each single condition and then compose the results to determine whether the composed prompt is satisfied. Our key insight is that lift scores can be efficiently approximated using only the original diffusion model, requiring no additional training or external modules. We develop an optimized variant that achieves minimal computational overhead during inference while maintaining effectiveness. Through extensive experiments, we demonstrate that lift scores significantly improve compositional generation across 2D synthetic data, CLEVR position tasks, and textto-image synthesis.
Paperid:1582
Authors:Zheng Lian · Haiyang Sun · Licai Sun · Haoyu Chen · Lan Chen · Hao Gu · Zhuofan Wen · Shun Chen · Zhang Siyuan · Hailiang Yao · Bin Liu · Rui Liu · Shan Liang · Ya Li · Jiangyan Yi · Jianhua Tao
Abstract:
Multimodal Emotion Recognition (MER) is a critical research area that seeks to decode human emotions from diverse data modalities. However, existing machine learning methods predominantly rely on predefined emotion taxonomies, which fail to capture the inherent complexity, subtlety, and multiappraisal nature of human emotional experiences, as demonstrated by studies in psychology and cognitive science. To overcome this limitation, we advocate for introducing the concept ofopen vocabularyinto MER. This paradigm shift aims to enable models to predict emotions beyond a fixed label space, accommodating a flexible set of categories to better reflect the nuanced spectrum of human emotions. To achieve this, we propose a novel paradigm:Open-Vocabulary MER (OV-MER), which enables emotion prediction without being confined to predefined spaces. However, constructing a dataset that encompasses the full range of emotions for OV-MER is practically infeasible; hence, we present a comprehensive solution including a newly curated database, novel evaluation metrics, and a preliminary benchmark. By advancing MER from basic emotions to more nuanced and diverse emotional states, we hope this work can inspire the next generation of MER, enhancing its generalizability and applicability in real-world scenarios.
Paperid:1583
Authors:Niklas Pfister · Václav Volhejn · Manuel Knott · Santiago Arias · Julia Bazinska · Mykhailo Bichurin · Alan Commike · Janet Darling · Peter Dienes · Matthew Fiedler · David Haber · Matthias Kraft · Marco Lancini · Max Mathys · Damian Pascual-Ortiz · Jakub Podolak · Adrià Romero-López · Kyriacos Shiarlis · Andreas Signer · Zsolt Terek · Athanasios Theocharis · Daniel Timbrell · Samuel Trautwein · Samuel Watts · Yun-Han Wu · Mateo Rojas-Carulla
Abstract:
Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors:the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose DSEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing RedCrowd, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack. Using RedCrowd, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications.
Paperid:1584
Authors:Chris Dong · Jannik Peters
Abstract:
Multiwinner voting is the study of electing a fixedsize committee given individual agents' preferences over candidates. Most research in this field has been limited to a static setting, with only one election over a fixed set of candidates. However, this approach overlooks the dynamic nature of applications, where candidate sets are subject to change.We extend the study of proportionality in multiwinner voting to dynamic settings, allowing candidates to join or leave the election and demanding that each chosen committee satisfies proportionality without differing too much from the previously selected committee. We consider approval preferences, ranked preferences, and the proportional clustering setting. In these settings, we either give algorithms making few changes or show that such algorithms cannot exist for various proportionality axioms. In particular, we show that such algorithms cannot exist for ranked preferences and provide amortized and exact algorithms for several proportionality notions in the other two settings.
Paperid:1585
Authors:Karn Tiwari · Niladri Dutta · Prathosh AP · N M Anoop Krishnan
Abstract:
Neural operators have emerged as powerful datadriven frameworks for solving Partial Differential Equations (PDEs), offering significant speedups over numerical methods. However, existing neural operators struggle with scalability in high-dimensional spaces, incur high computational costs, and face challenges in capturing continuous and long-range dependencies in PDE dynamics. To address these limitations, we introduce the Latent Mamba Operator (LaMO), which integrates the efficiency of state-space models (SSMs) in latent space with the expressive power of kernel integral formulations in neural operators. We also establish a theoretical connection between state-space models (SSMs) and the kernel integral of neural operators. Extensive experiments across diverse PDE benchmarks on regular grids, structured meshes, and point clouds covering solid and fluid physics datasets, LaMOs achieve consistent state-of-the-art (SOTA) performance, with a 32.3\% improvement over existing baselines in solution operator approximation, highlighting its efficacy in modeling complex PDEs solution.
Paperid:1586
Authors:Aleksandr Gushchin · Khaled Abud · Georgii Bychkov · Ekaterina Shumitskaya · Anna Chistyakova · Sergey Lavrushkin · Bader Rasheed · Kirill Malyshev · Dmitriy Vatolin · Anastasia Antsiferova
Abstract:
Modern neuralnetwork-based Image Quality Assessment (IQA) metrics are vulnerable to adversarial attacks, which can be exploited to manipulate search engine rankings, benchmark results, and content quality assessments, raising concerns about the reliability of IQA metrics in critical applications. This paper presents the first comprehensive study of IQA defense mechanisms in response to adversarial attacks on these metrics to pave the way for safer use of IQA metrics. We systematically evaluated 30 defense strategies, including purification, training-based, and certified methods --- and applied 14 adversarial attacks in adaptive and non-adaptive settings to compare these defenses on 9 no-reference IQA metrics. Our proposed benchmark aims to guide the development of IQA defense methods and is open to submissions; the latest results and code are at (link hidden for blind review).
Paperid:1587
Authors:Christian Fabian · Kai Cui · Heinz Koeppl
Abstract:
Large agent networks are abundant in applications and nature and pose difficult challenges in the field of multiagent reinforcement learning (MARL) due to their computational and theoretical complexity. While graphon mean field games and their extensions provide efficient learning algorithms for dense and moderately sparse agent networks, the case of realistic sparser graphs remains largely unsolved. Thus, we propose a novel mean field control model inspired by local weak convergence to include sparse graphs such as power law networks with coefficients above two. Besides a theoretical analysis, we design scalable learning algorithms which apply to the challenging class of graph sequences with finite first moment. We compare our model and algorithms for various examples on synthetic and real world networks with mean field algorithms based on Lp graphons and graphexes. As it turns out, our approach outperforms existing methods in many examples and on various networks due to the special design aiming at an important, but so far hard to solve class of MARL problems.
Paperid:1588
Authors:Leon Götz · Marcel Kollovieh · Stephan Günnemann · Leo Schwinn
Abstract:
Despite recent advances in subquadratic attention mechanisms or statespace models, processing long token sequences still imposes significant computational requirements. Token merging has emerged as a solution to increase computational efficiency in computer vision architectures.In this work, we perform the first investigations of token merging in time series analysis on both transformers and state-space models. We further introduce local merging, a domain-specific token merging algorithm that selectively combines tokens within a local neighborhood, achieving two major benefits: a) Local merging can adjust its computational complexity from quadratic to linear based on the neighborhood size to effectively scale to long sequences; b) Local merging is the first causal merging scheme enabling token merging in transformer decoders. Further, we identify spectral properties of the input data that reliably predict the potential benefits of local merging without requiring evaluation on downstream tasks. Our comprehensive empirical evaluation demonstrates that local merging offers substantial efficiency gains with minimal impact on accuracy, achieving up to 5400% acceleration on the recently proposed Chronos foundation model.
Paperid:1589
Authors:Zi-Hao Zhou · Jun-Jie Wang · Tong Wei · Min-Ling Zhang
Abstract:
Contrastive learning has achieved remarkable success in learning effective representations, with supervised contrastive learning often outperforming selfsupervised approaches. However, in real-world scenarios, data annotations are often ambiguous or inaccurate, meaning that class labels may not reliably indicate whether two examples belong to the same class. This limitation restricts the applicability of supervised contrastive learning. To address this challenge, we introduce the concept of ``continuous semantic similarity'' to define positive and negative pairs. Instead of directly relying on imprecise class labels, we measure the semantic similarity between example pairs, which quantifies how closely they belong to the same category by iteratively refining weak supervisory signals. Based on this concept, we propose a graph-theoretic framework for weakly-supervised contrastive learning, where semantic similarity serves as the graph weights. Our framework is highly versatile and can be applied to many weakly-supervised learning scenarios. We demonstrate its effectiveness through experiments in two common settings, i.e., noisy label and partial label learning, where existing methods can be easily integrated to significantly improve performance. Theoretically, we establish an error bound for our approach, showing that it can approximate supervised contrastive learning under mild conditions. The supplementary material includes the source code for reproducibility.
Paperid:1590
Authors:Chen-Chia Chang · Wan-Hsuan Lin · Yikang Shen · Yiran Chen · Xin Zhang
Abstract:
Abstract:Automation of analog topology design is crucial due to customized requirements of modern applications with heavily manual engineering efforts. The stateof-the-art work applies a sequence-to-sequence approach and supervised finetuning on language models to generate topologies given user specifications.However, its circuit formulation is inefficient due to $O(|V|^2)$ token length and suffers from low precision sensitivity to numeric inputs.In this work, we introduce LaMAGIC2, a succinct float-input canonical formulationwith identifier (SFCI) for language model-based analog topology generation.SFCI addresses these challenges by improving component-type recognition through identifier-based representations, reducing token length complexity to $O(|V| + |E|)$, and enhancing numeric precision sensitivity for better performance under tight tolerances.Our experiments demonstrate that LaMAGIC2 achieves 34\% higher success rates under a tight tolerance 0.01 and 10X lower MSEs compared to a prior method. LaMAGIC2 also exhibits better transferability for circuits with more vertices with up to 58.5\% improvement.These advancements establish LaMAGIC2 as a robust framework for analog topology generation.
Paperid:1591
Authors:Wonsuk Jang · Thierry Tambe
Abstract:
The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardwaresupported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.
Paperid:1592
Authors:Sangyeon Park · Isaac Han · Seungwon Oh · KyungJoong Kim
Abstract:
Plasticity loss, a critical challenge in neural network training, limits a model's ability to adapt to new tasks or shifts in data distribution. While widely used techniques like L2 regularization and Layer Normalization have proven effective in mitigating this issue, Dropout remains notably ineffective. This paper introduces AID (Activation by Intervalwise Dropout), a novel method inspired by Dropout, designed to address plasticity loss. Unlike Dropout, AID generates subnetworks by applying Dropout with different probabilities on each preactivation interval. Theoretical analysis reveals that AID regularizes the network, promoting behavior analogous to that of deep linear networks, which do not suffer from plasticity loss. We validate the effectiveness of AID in maintaining plasticity across various benchmarks, including continual learning tasks on standard image classification datasets such as CIFAR10, CIFAR100, and TinyImageNet. Furthermore, we show that AID enhances reinforcement learning performance in the Arcade Learning Environment benchmark.
Paperid:1593
Authors:Mara Finkelstein · Daniel Deutsch · Parker Riley · Juraj Juraska · Geza Kovacs · Markus Freitag
Abstract:
As LLMs continue to become more powerful and versatile, human evaluation has become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves stateof-the-art evaluators for many tasks. TheseAutoratersare typically designed so that they generalize to new systemsandtest sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure the capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate ourSpecialistmethod on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks.
Paperid:1594
Authors:Xiaoqian Shen · Yunyang Xiong · Changsheng Zhao · Lemeng Wu · Jun Chen · Chenchen Zhu · Zechun Liu · Fanyi Xiao · Balakrishnan Varadarajan · Florian Bordes · Zhuang Liu · Hu Xu · Hyunwoo Kim · Bilge Soran · Raghuraman Krishnamoorthi · Mohamed Elhoseiny · Vikas Chandra
Abstract:
Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by the limited context length. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism to reduce the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging crossmodal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within limited context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance. Our code will be made publicly available.
Paperid:1595
Authors:Gouki Minegishi · Hiroki Furuta · Shohei Taniguchi · Yusuke Iwasawa · Yutaka Matsuo
Abstract:
Transformerbased language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through phase transitions, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model’s circuit during training by extending the copy task from previous research to an In-Context Meta Learning setting, where models must infer tasks from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phase transition in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the Transformer’s ICL ability.
Paperid:1596
Authors:Shoukai Xu · ZihaoLian · Mingkui Tan · Liu Liu · Zhong Zhang · Peilin Zhao
Abstract:
Offline reinforcement learning is widely applied in multiple fields due to its advantages in efficiency and risk control. However, a major problem it faces is the distribution shift between offline datasets and online environments. This mismatch leads to outof-distribution (OOD) state-action pairs that fall outside the scope of the training data. Therefore, existing conservative training policies may not provide reliable decisions when the test environment deviates greatly from the offline dataset. In this paper, we propose Test-time Adapted Reinforcement Learning (TARL) to address this problem. TARL constructs unsupervised test-time optimization objectives for discrete and continuous control tasks, using test data without depending on environmental rewards. In discrete control tasks, it minimizes the entropy of predicted action probabilities to decrease uncertainty and avoid OOD state-action pairs. For continuous control tasks, it represents and minimizes action uncertainty based on the normal distribution of policy network outputs. Moreover, to prevent model bias caused by overfitting and error accumulation during the test-time update process, TARL enforces a KL divergence constraint between the fine-tuned policy and the original policy. For efficiency, TARL only updates the layer normalization layer parameters during testing. Extensive experiments on popular Atari game benchmarks and the D4RL dataset demonstrate the superiority of our method. Our method achieved a significant improvement over CQL, with a 7.1 episode return relative increase on the walker2d-medium-v2 task.
Paperid:1597
Authors:Yiming Chen · yuan zhang · Yin Liu · Kun Yuan · Zaiwen Wen
Abstract:
The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memoryefficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.
Paperid:1598
Authors:Dongwoo Lee · Dong Bok Lee · Steven Adriaensen · Juho Lee · Sung Ju Hwang · Frank Hutter · Seon Joo Kim · Hae Beom Lee
Abstract:
Scaling has been a major driver of recent advancements in deep learning. Numerous empirical studies have found that scaling laws often follow the powerlaw and proposed several variants of power-law functions to predict the scaling behavior at larger scales. However, existing methods mostly rely on point estimation and do not quantify uncertainty, which is crucial for real-world applications involving decision-making problems such as determining the expected performance improvements achievable by investing additional computational resources. In this work, we explore a Bayesian framework based on Prior-data Fitted Networks (PFNs) for neural scaling law extrapolation. Specifically, we design a prior distribution that enables the sampling of infinitely many synthetic functions resembling real-world neural scaling laws, allowing our PFN to meta-learn the extrapolation. We validate the effectiveness of our approach on real-world neural scaling laws, comparing it against both the existing point estimation methods and Bayesian approaches. Our method demonstrates superior performance, particularly in data-limited scenarios such as Bayesian active learning, underscoring its potential for reliable, uncertainty-aware extrapolation in practical applications.
Paperid:1599
Authors:Jaesik Yoon · Hyeonseo Cho · Doojin Baek · Yoshua Bengio · Sungjin Ahn
Abstract:
Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)—whose performance naturally improves with additional test‐time computation (TTC)—standard diffusion‐based planners offer only limited avenues for TTC scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength of diffusion models with the adaptive search capabilities of MCTS. Our method reconceptualizes denoising as a tree‐structured process, allowing partially denoised plans to be iteratively evaluated, pruned, and refined. By selectively expanding promising trajectories while retaining the flexibility to revisit and improve suboptimal branches, MCTD achieves the benefits of MCTS such as controlling explorationexploitation trade-offs within the diffusion framework. Empirical results on challenging long‐horizon tasks show that MCTD outperforms diffusion baselines, yielding higher‐quality solutions as TTC increases.
Paperid:1600
Authors:Da Long · Zhitong Xu · Guang Yang · Akil Narayan · Shandian Zhe
Abstract:
Modern physics simulation often involves multiple functions of interests, and traditional numerical approaches are known to be complex and computationally costly. While machine learningbased surrogate models can offer significant cost reductions, most focus on a single task, such as forward prediction, and typically lack uncertainty quantification --- an essential component in many applications. To overcome these limitations, we propose Arbitrarily-Conditioned Multi-Functional Diffusion (ACM-FD), a versatile probabilistic surrogate model for multi-physics emulation. ACM-FD can perform a wide range of tasks within a single framework, including forward prediction, various inverse problems, and simulating data for entire systems or subsets of quantities conditioned on others. Specifically, we extend the standard Denoising Diffusion Probabilistic Model (DDPM) for multi-functional generation by modeling noise as Gaussian processes (GP). We propose a random-mask based, zero-regularized denoising loss to achieve flexible and robust conditional generation. We induce a Kronecker product structure in the GP covariance matrix, substantially reducing the computational cost and enabling efficient training and sampling. We demonstrate the effectiveness of ACM-FD across several fundamental multi-physics systems.
Paperid:1601
Authors:Xinxing Shi · Xiaoyu Jiang · Mauricio Álvarez
Abstract:
Gaussian Process (GP) Variational Autoencoders (VAEs) extend standard VAEs by replacing the fully factorised Gaussian prior with a GP prior, thereby capturing richer correlations among latent variables. However, performing exact GP inference in largescale GPVAEs is computationally prohibitive, often forcing existing approaches to rely on restrictive kernel assumptions or large sets of inducing points. In this work, we propose a neighbour-driven approximation strategy that exploits local adjacencies in the latent space to achieve scalable GPVAE inference. By confining computations to the nearest neighbours of each data point, our method preserves essential latent dependencies, allowing more flexible kernel choices and mitigating the need for numerous inducing points. Through extensive experiments on tasks including representation learning, data imputation, and conditional generation, we demonstrate that our approach outperforms other GPVAE variants in both predictive performance and computational efficiency.
Paperid:1602
Authors:Wei Liu · Zhongyu Niu · Lang Gao · Zhiying Deng · Jun Wang · Haozhao Wang · Ruixuan Li
Abstract:
This study investigates the selfrationalization framework constructed with a cooperative game, where a generator initially extracts the most informative segment from raw input, and a subsequent predictor utilizes the selected subset for its input. The generator and predictor are trained collaboratively to maximize prediction accuracy. In this paper, we first uncover a potential caveat: such a cooperative game could unintentionally introduce a sampling bias during rationale extraction. Specifically, the generator might inadvertently create an incorrect correlation between the selected rationale candidate and the label, even when they are semantically unrelated in the original dataset. Subsequently, we elucidate the origins of this bias using both detailed theoretical analysis and empirical evidence. Our findings suggest a direction for inspecting these correlations through attacks, based on which we further introduce an instruction to prevent the predictor from learning the correlations.Through experiments on six text classification datasets and two graph classification datasets using three network architectures (GRUs, BERT, and GCN), we show that our method significantly outperforms recent rationalization methods.
Paperid:1603
Authors:Virginia Aglietti · Ira Ktena · Jessica Schrouff · Eleni Sgouritsa · Francisco Ruiz · Alan Malek · Alexis Bellot · Silvia Chiappa
Abstract:
The sample efficiency of Bayesian optimization algorithms depends on carefully crafted acquisition functions (AFs) guiding the sequential collection of function evaluations. The bestperforming AF can vary significantly across optimization problems, often requiring ad-hoc and problem-specific choices. This work tackles the challenge of designing novel AFs that perform well across a variety of experimental settings. Based on FunSearch, a recent work using Large Language Models (LLMs) for discovery in mathematical sciences, we propose FunBO, an LLM-based method that can be used to learn new AFs written in computer code by leveraging access to a limited number of evaluations for a set of objective functions. We provide the analytic expression of all discovered AFs and evaluate them on various global optimization benchmarks and hyperparameter optimization tasks. We show how FunBO identifies AFs that generalize well in and out of the training distribution of functions, thus outperforming established general-purpose AFs and achieving competitive performance against AFs that are customized to specific function types and are learned via transfer-learning algorithms.
Paperid:1604
Authors:Yitian Zhang · Liheng Ma · Antonios Valkanas · Boris Oreshkin · Mark Coates
Abstract:
Koopman operator theory provides a framework for nonlinear dynamical system analysis and time series forecasting by mapping dynamics to a space of realvalued measurement functions, enabling a linear operator representation. Despite the advantage of linearity, the operator is generally infinite-dimensional. Therefore, the objective is to learn measurement functions that yield a tractable finite-dimensional Koopman operator approximation. In this work, we establish a connection between Koopman operator approximation and linear Recurrent Neural Networks (RNNs), which have recently demonstrated remarkable success in sequence modeling. We show that by considering an extended state consisting of lagged observations, we can establish an equivalence between a structured Koopman operator and linear RNN updates. Building on this connection, we present SKOLR, which integrates a learnable spectral decomposition of the input signal with a multilayer perceptron (MLP) as the measurement functions and implements a structured Koopman operator via a highly parallel linear RNN stack. Numerical experiments on various forecasting benchmarks and dynamical systems show that this streamlined, Koopman-theory-based design delivers exceptional performance.
Paperid:1605
Authors:Jiahua Rao · Dahao Xu · Wentao Wei · Yicong Chen · Mingjun Yang · Yuedong Yang
Abstract:
While Graph Neural Networks (GNNs) and Transformers have shown promise in predicting molecular properties, they struggle with directly modeling complex manybody interactions. Current methods often implicitly approximate these interactions in message passing, such as three- and four-body terms, whereas attention-based models, though promising for enabling direct communication between atoms, typically handle at most triplets of nodes, making higher-order interactions computationally challenging. To address these limitations, we introduce MABNet, a geometric attention framework designed to model four-body interactions by facilitating direct communication among atomic quartets. This approach bypasses the computational bottlenecks associated with traditional triplet-based attention mechanisms, allowing for the efficient handling of higher-order interactions. MABNet achieves state-of-the-art performance on benchmarks like MD22 and SPICE. These improvements underscore its capability to accurately capture intricate many-body interactions in large molecules. By unifying rigorous many-body physics with computational efficiency, MABNet advances molecular simulations for applications in drug design and materials discovery, while its extensible framework paves the way for modeling higher-order quantum effects.
Paperid:1606
Authors:Geonhui Yoo · Minhak Song · Chulhee Yun
Abstract:
When training deep neural networks with gradient descent, sharpness often increasesa phenomenon known asprogressive sharpening---before saturating at theedge of stability. Although commonly observed in practice, the underlying mechanisms behind progressive sharpening remain poorly understood. In this work, we study this phenomenon using a minimalist model: a deep linear network with a single neuron per layer. We show that this simple model effectively captures the sharpness dynamics observed in recent empirical studies, offering a simple testbed to better understand neural network training. Moreover, we theoretically analyze how dataset properties, network depth, stochasticity of optimizers, and step size affect the degree of progressive sharpening in the minimalist model. We then empirically demonstrate how these theoretical insights extend to practical scenarios. This study offers a deeper understanding of sharpness dynamics in neural network training, highlighting the interplay between depth, training data, and optimizers.
Paperid:1607
Authors:Vivek Farias · Joren Gijsbrechts · Aryan Khojandi · Tianyi Peng · Andrew Zheng
Abstract:
Simulating a single trajectory of a dynamical system under some statedependent policy is a core bottleneck in policy optimization (PO) algorithms. The many inherently serial policy evaluations that must be performed in a single simulation constitute the bulk of this bottleneck. In applying PO to supply chain optimization (SCO) problems, simulating a single sample path corresponding to one month of a supply chain can take several hours. We present an iterative algorithm to accelerate policy simulation, dubbed Picard Iteration. This scheme carefully assigns policy evaluation tasks to independent processes. Within an iteration, any given process evaluates the policy only on its assigned tasks while assuming a certain ‘cached’ evaluation for other tasks; the cache is updated at the end of the iteration. Implemented on GPUs, this scheme admits batched evaluation of the policy across a single trajectory. We prove that the structure afforded by many SCO problems allows convergence in a small number of iterations independent of the horizon. We demonstrate practical speedups of 400x on large-scale SCO problems even with a single GPU, and also demonstrate practical efficacy in other RL environments.
Paperid:1608
Authors:Hongbin Pei · Jingxin Hai · Yu Li · Huiqi Deng · Denghao Ma · Jie Ma · Pinghui Wang · Jing Tao · Xiaohong Guan
Abstract:
Pseudolabeling is a widely used strategy in semi-supervised learning. Existing methods typically select predicted labels with high confidence scores and high training stationarity, as pseudo-labels to augment training sets. In contrast, this paper explores the pseudo-labeling potential of predicted labels thatdo notexhibit these characteristics. We discover a new type of predicted labels suitable for pseudo-labeling, termedtwo-phase labels, which exhibit a two-phase pattern during training:they are initially predicted as one category in early training stages and switch to another category in subsequent epochs.Case studies show the two-phase labels are informative for decision boundaries. To effectively identify the two-phase labels, we design a 2-phasicmetric that mathematically characterizes their spatial and temporal patterns. Furthermore, we propose a loss function tailored for two-phase pseudo-labeling learning, allowing models not only to learn correct correlations but also to eliminate false ones. Extensive experiments on eight datasets show thatour proposed 2-phasicmetric acts as a powerful boosterfor existing pseudo-labeling methods by additionally incorporating the two-phase labels, achieving an average classification accuracy gain of 1.73% on image datasets and 1.92% on graph datasets.
Paperid:1609
Authors:Corinna Coupette · Jeremy Wayland · Emily Simons · Bastian Rieck
Abstract:
Benchmark datasets have proved pivotal to the success of graph learning, andgoodbenchmark datasets are crucial to guide the development of the field. Recent research has highlighted problems with graphlearning datasets and benchmarking practices—revealing, for example, that methods which ignore the graph structure can outperform graph-based approaches on popular benchmark datasets. Such findings raise two questions: (1) What makes a good graph-learning dataset, and (2) how can we evaluate dataset quality in graph learning? Our work addresses these questions. As the classic evaluation setup uses datasets to evaluate models, it does not apply to dataset evaluation. Hence, we start from first principles. Observing that graph-learning datasets uniquely combine two modes—the graph structure and the node features—, we introduce RINGS, a flexible and extensiblemode-perturbation frameworkto assess the quality of graph-learning datasets based ondataset ablations—i.e., by quantifying differences between the original dataset and its perturbed representations. Within this framework, we propose two measures—performance separabilityandmode complementarity—as evaluation tools, each assessing, from a distinct angle, the capacity of a graph dataset to benchmark the power and efficacy of graph-learning methods. We demonstrate the utility of our framework for graph-learning dataset evaluation in an extensive set of experiments and derive actionable recommendations for improving the evaluation of graph-learning methods. Our work opens new research directions in data-centric graph learning, and it constitutes a first step toward the systematicevaluation of evaluations.
Paperid:1610
Authors:Tianlang Chen · Charilaos Kanatsoulis · Jure Leskovec
Abstract:
Predictive tasks on relational databases are critical in realworld applications spanning e-commerce, healthcare, and social media. To address these tasks effectively, Relational Deep Learning (RDL) encodes relational data as graphs, enabling Graph Neural Networks (GNNs) to exploit relational structures for improved predictions. However, existing heterogeneous GNNs often overlook the intrinsic structural properties of relational databases, leading to modeling inefficiencies. Here we introduce RelGNN, a novel GNN framework specifically designed to capture the unique characteristics of relational databases. At the core of our approach is the introduction of atomic routes, which are sequences of nodes forming high-order tripartite structures. Building upon these atomic routes, RelGNN designs new composite message passing mechanisms between heterogeneous nodes, allowing direct single-hop interactions between them. This approach avoids redundant aggregations and mitigates information entanglement, ultimately leading to more efficient and accurate predictive modeling. RelGNN is evaluated on 30 diverse real-world tasks from RelBench (Fey et al., 2024), and consistently achieves state-of-the-art accuracy with up to 25% improvement.
Paperid:1611
Authors:Connor Schenck · Isaac Reid · Mithun Jacob · Alex Bewley · Joshua Ainslie · David Rendleman · Deepali Jain · Mohit Sharma · Kumar Avinava Dubey · Ayzaan Wahid · Sumeet Singh · René Wagner · Tianli Ding · Chuyuan Fu · Arunkumar Byravan · Jacob J Varley · Alexey Gritsenko · Matthias Minderer · Dmitry Kalashnikov · Jonathan Tompson · Vikas Sindhwani · Krzysztof Choromanski
Abstract:
Abstract:We introduce $\textbf{STRING}$: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides $\textbf{exact}$ translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers.We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods. Videos of STRING-based robotics controllers can be found here: https://sites.google.com/view/string-robotics.
Paperid:1612
Authors:Qingtian Zhu · Yumin Zheng · Yuling Sang · Yifan Zhan · Ziyan Zhu · Jun Ding · Yinqiang Zheng
Abstract:
Spatial Transcriptomics (ST) is a method that captures gene expression profiles aligned with spatial coordinates.The discrete spatial distribution and the superhigh dimensional sequencing results make ST data challenging to be modeled effectively.In this paper, we manage to model ST in a continuous and compact manner by the proposed tool, SUICA, empowered by the great approximation capability of Implicit Neural Representations (INRs) that can enhance both the spatial density and the gene expression.Concretely within the proposed SUICA, we incorporate a graph-augmented Autoencoder to effectively model the context information of the unstructured spots and provide informative embeddings that are structure-aware for spatial mapping.We also tackle the extremely skewed distribution in a regression-by-classification fashion and enforce classification-based loss functions for the optimization of SUICA.By extensive experiments of a wide range of common ST platforms under varying degradations, SUICA outperforms both conventional INR variants and SOTA methods regarding numerical fidelity, statistical correlation, and bio-conservation.The prediction by SUICA also showcases amplified gene signatures that enriches the bio-conservation of the raw data and benefits subsequent analysis.The code will be made available.
Paperid:1613
Authors:Mujin Cheon · Dong-Yeun Koh · Jay Lee · Calvin Tsay
Abstract:
Conventional methods for Bayesian optimization (BO) primarily involve onestep optimal decisions (e.g., maximizing expected improvement of the next step). To avoid myopic behavior, multi-step lookahead BO algorithms such as rollout strategies consider the sequential decision-making nature of BO, i.e., as a stochastic dynamic programming problem, demonstrating promising results in recent years. However, owing to the curse of dimensionality, most of these methods make significant approximations or suffer scalability issues, e.g., being limited to two-step lookahead. This paper presents a novel reinforcement learning (RL)-based framework for multi-step lookahead BO in high-dimensional black-box optimization problems. The proposed method enhances the scalability and decision-making quality of multi-step lookahead BO by efficiently solving the SDP of the BO process in a near-optimal manner using RL. We first introduce an Attention-DeepSets encoder to represent the state of knowledge to the RL agent and employ off-policy learning to accelerate its initial training. We then propose a multi-task, fine-tuning procedure based on end-to-end (encoder-RL) on-policy learning. We evaluate the proposed method, EARL-BO (Encoder Augmented RL for Bayesian Optimization), on both syn-thetic benchmark functions and real-world hyperparameter optimization problems, demonstrating significantly improved performance compared to existing multi-step lookahead and high-dimensional BO methods.
Paperid:1614
Authors:Suorong Yang · Peng Ye · Furao Shen · Dongzhan Zhou
Abstract:
Dynamic data selection aims to accelerate training with lossless performances.However, reducing training data inherently limits data diversity, potentially hindering generalization.While data augmentation is widely used to enhance diversity, it is typically not optimized in conjunction with selection.As a result, directly combining these techniques fails to fully exploit their synergies.To tackle the challenge, we propose a novel online data training framework that, for the first time, unifies dynamic data selection and augmentation, achieving both training efficiency and enhanced performance.Our method estimates each sample's joint distribution of local density and multimodal semantic consistency, allowing for the targeted selection of augmentationsuitable samples while suppressing the inclusion of noisy or ambiguous data.This enables a more significant reduction in dataset size without sacrificing model generalization.Experimental results demonstrate that our method outperforms existing state-of-the-art approaches on various benchmark datasets and architectures, e.g., reducing 50% training costs on ImageNet-1k with lossless performance.Furthermore, our approach enhances noise resistance and improves model robustness, reinforcing its practical utility in real-world scenarios.
Paperid:1615
Authors:Jingyang Qiao · zhizhong zhang · Xin Tan · Yanyun Qu · Shouhong Ding · Yuan Xie
Abstract:
Continual Instruction Tuning (CIT) is adopted to continually instruct Large Models to follow human intent data by data. It is observed that existing gradient update would heavily destroy the performance on previous datasets during CIT process. Instead, Exponential Moving Average (EMA), owns the ability to trace previous parameters, which can aid in decreasing forgetting. Nonetheless, its stable balance weight fails to deal with the everchanging datasets, leading to the out-of-balance between plasticity and stability. In this paper, we propose a general continual instruction tuning framework to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight can be automatically determined by the gradients and learned parameters. Therefore, we propose a stable-plasticity balanced coefficient to avoid knowledge confusion. Based on the semantic similarity of the instructions, we can determine whether to retrain or expand the training parameters and allocate the most suitable parameters for the testing instances. Extensive experiments across multiple continual instruction tuning benchmarks demonstrate that our approach not only enhances anti-forgetting capabilities but also significantly improves overall continual tuning performance. For example, based on LLaVA-7B, the forgetting is reduced from 5.42 to 1.93. Our code will be made publicly available soon.
Paperid:1616
Authors:Alvaro Rodriguez Abella · João Pedro Silvestre · Paulo Tabuada
Abstract:
A key component of transformers is the attention mechanism orchestrating how each token influences the propagation of every other token along the layers of a transformer. In this paper we provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers. Although we present several results based on different assumptions, all of them point to the same conclusion, all tokens asymptotically converge to each other, a phenomenon that has been empirically reported in the literature. Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT2 model.
Paperid:1617
Authors:Malte Mosbach · Jan Niklas Ewertz · Angel Villar-Corrales · Sven Behnke
Abstract:
Learning a latent dynamics model provides a taskagnostic representation of an agent's understanding of its environment. Leveraging this knowledge for model-based reinforcement learning (RL) holds the potential to improve sample efficiency over model-free methods by learning from imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment’s state. In contrast, humans reason about objects and their interactions, predicting how actions will affect specific parts of their surroundings. Inspired by this, we proposeSlot-Attention for Object-centric Latent Dynamics (SOLD), a novel model-based RL algorithm that learns object-centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3 and TD-MPC2 - state-of-the-art model-based RL algorithms - across a range of benchmark robotic environments that require both relational reasoning and low-level manipulation capabilities. Videos and code are available at https: //sold-anonymized.github.io.
Paperid:1618
Authors:Hideaki Kim · Tomoharu Iwata · Akinori Fujino
Abstract:
Abstract:Kernel methodbased intensity estimators, formulated within reproducing kernel Hilbert spaces (RKHSs), and classical kernel intensity estimators (KIEs) have been among the most easy-to-implement and feasible methods for estimating the intensity functions of inhomogeneous Poisson processes. While both approaches share the term *"kernel"*, they are founded on distinct theoretical principles, each with its own strengths and limitations. In this paper, we propose a novel regularized kernel method for Poisson processes based on the least squares loss and show that the resulting intensity estimator involves a specialized variant of the representer theorem: it has the dual coefficient of unity and coincides with classical KIEs. This result provides new theoretical insights into the connection between classical KIEs and kernel method-based intensity estimators, while enabling us to develop an efficient KIE by leveraging advanced techniques from RKHS theory. We refer to the proposed model as the *kernel method-based kernel intensity estimator* (K$^2$IE). Through experiments on synthetic datasets, we show that K$^2$IE achieves comparable predictive performance while significantly surpassing state-of-the-art kernel method-based estimators in computational efficiency.
Paperid:1619
Authors:Mehrdad Ghadiri · Matthew Fahrbach · Yunbum Kook · Ali Jadbabaie
Abstract:
We study tensor completion (TC) through the lens of lowrank tensor decomposition (TD). Many TD algorithms use fast alternating minimization methods, which solvehighly structuredlinear regression problems at each step (e.g., for CP, Tucker, and tensor-train decompositions). However, such algebraic structure is lost in TC regression problems, making direct extensions unclear. To address this, we propose aliftingapproach that approximately solves TC regression problems using structured TD regression algorithms as blackbox subroutines, enabling sublinear-time methods. We theoretically analyze the convergence rate of our approximate Richardson iteration based algorithm, and we demonstrate on real-world tensors that its running time can be 100x faster than direct methods for CP completion.
Paperid:1620
Authors:Shih-Min Yang · Martin Magnusson · Johannes Stork · Todor Stoyanov
Abstract:
Soft ActorCritic (SAC) has achieved notable success in continuous control tasks but struggles in sparse reward settings, where infrequent rewards make efficient exploration challenging. While novelty-based exploration methods address this issue by encouraging the agent to explore novel states, they are not trivial to apply to SAC. In particular, managing the interaction between novelty-based exploration and SAC’s inherently stochastic policy can lead to inefficient exploration and redundant sample collection. In this paper, we propose KEA (Keeping Exploration Alive) which tackles the inefficiencies in balancing exploration strategies when combining SAC with novelty-based exploration. KEA introduces an additional co-behavior agent that works alongside SAC and a switching mechanism to facilitate proactive coordination between exploration strategies from novelty-based exploration and stochastic policy. This coordination allows the agent to maintain stochasticity in high-novelty regions, enhancing exploration efficiency and reducing repeated sample collection. We first analyze this potential issue in a 2D navigation task and then evaluate KEA on sparse reward control tasks from the DeepMind Control Suite. Compared to state-of-the-art novelty-based exploration baselines, our experiments show that KEA significantly improves learning efficiency and robustness in sparse reward setups.
Paperid:1621
Authors:Haofei Lu · Yifei Shen · Dongsheng Li · Junliang Xing · Dongqi Han
Abstract:
Diffusion models have shown great promise in decisionmaking, also known as diffusion planning. However, the slow inference speeds limit their potential for broader real-world applications. Here, we introduceHabi, a general framework that transforms powerful but slow diffusion planning models into fast decision-making models, which mimics the cognitive process in the brain that costly goal-directed behavior gradually transitions to efficient habitual behavior with repetitive practice. Even using a laptop CPU, the habitized model can achieve an average800+ Hzdecision-making frequency (faster than previous diffusion planners by orders of magnitude) on standard offline reinforcement learning benchmarks D4RL, while maintaining comparable or even higher performance compared to its corresponding diffusion planner. Our work proposes a fresh perspective of leveraging powerful diffusion models for real-world decision-making tasks. We also provide robust evaluations and analysis, offering insights from both biological and engineering perspectives for efficient and effective decision-making.
Paperid:1622
Authors:Zihang Liu · Tianyu Pang · Oleg Balabanov · Chaoqun Yang · Tianjin Huang · Lu Yin · Yaoqing Yang · Shiwei Liu
Abstract:
Existing Large Language Model (LLM) finetuning methods are not optimal. Low-rank-based parameter-efficient fine-tuning (PEFT) methods have superior training efficiency but inferior performance. On the other hand, Full Fine-tuning (Full FT) is capable but lacks efficiency. Sparse fine-tuning, which achieved notable success prior to the advent of LLMs by fine-tuning only a small fraction of parameters, has the potential to balance effectiveness and efficiency. Yet it has fallen behind in the era of modern LLMs, as it struggles to identify parameters truly critical to fine-tuning. In this work, we state that weights with the largest magnitude after low-rank approximation are critical weights for fine-tuning, which we call Principal Weights. Surprisingly, while magnitude-based sparse fine-tuning performs poorly as a baseline on LLMs, our empirical results show it becomes the most effective approach for PEFT after applying rank reduction. Furthermore, by storing only the optimizer states of Principal Weights, we can significantly reduce the memory overhead, resulting in notable memory saving. The above findings lead us to Low-rank Informed Sparse Fine-Tuning (LIFT), a sparse fine-tuning approach that achieves superior fine-tuning performance over Full FT while maintaining memory efficiency comparable to common PEFT methods. In addition, compared to full fine-tuning and LoRA, LIFT not only demonstrates better learning on target domains such as arithmetic reasoning but also up to 20\% better in retaining source-domain knowledge in areas like commonsense reasoning. Code is submitted.
Paperid:1623
Authors:Junwei Su · Chuan Wu
Abstract:
This paper investigates the convergence behavior of scorebased graph generative models (SGGMs). Unlike common score-based generative models (SGMs) that are governed by a single stochastic differential equation (SDE), SGGMs utilize a system of dependent SDEs, where the graph structure and node features are modeled separately, while accounting for their inherent dependencies. This distinction makes existing convergence analyses from SGMs inapplicable for SGGMs. In this work, we present the first convergence analysis for SGGMs, focusing on the convergence bound (the risk of generative error) across three key graph generation paradigms: (1) feature generation with a fixed graph structure, (2) graph structure generation with fixed node features, and (3) joint generation of both graph structure and node features. Our analysis reveals several unique factors specific to SGGMs (e.g., the topological properties of the graph structure) which significantly affect the convergence bound. Additionally, we offer theoretical insights into the selection of hyperparameters (e.g., sampling steps and diffusion length) and advocate for techniques like normalization to improve convergence. To validate our theoretical findings, we conduct a controlled empirical study using a synthetic graph model. The results in this paper contribute to a deeper theoretical understanding of SGGMs and offer practical guidance for designing more efficient and effective SGGMs.
Paperid:1624
Authors:Peiyan Hu · Xiaowei Qian · Wenhao Deng · Rui Wang · Haodong Feng · Ruiqi Feng · Tao Zhang · Long Wei · Yue Wang · Zhi-Ming Ma · Tailin Wu
Abstract:
The application of deep learning for partial differential equation (PDE)constrained control is gaining increasing attention. However, existing methods rarely consider safety requirements crucial in real-world applications. To address this limitation, we propose Safe Diffusion Models for PDE Control (SafeDiffCon), which introduce the uncertainty quantile as model uncertainty quantification to achieve optimal control under safety constraints through both post-training and inference phases. Firstly, our approach post-trains a pre-trained diffusion model to generate control sequences that better satisfy safety constraints while achieving improved control objectives via a reweighted diffusion loss, which incorporates the uncertainty quantile estimated using conformal prediction. Secondly, during inference, the diffusion model dynamically adjusts both its generation process and parameters through iterative guidance and fine-tuning, conditioned on control targets while simultaneously integrating the estimated uncertainty quantile. We evaluate SafeDiffCon on three control tasks: 1D Burgers' equation, 2D incompressible fluid, and controlled nuclear fusion problem. Results demonstrate that SafeDiffCon is the only method that satisfies all safety constraints, whereas other classical and deep learning baselines fail. Furthermore, while adhering to safety constraints, SafeDiffCon achieves the best control performance.
Paperid:1625
Authors:Junbo Li · Zhangyang “Atlas” Wang · qiang liu
Abstract:
Abstract:Offline preference alignment for language models such as Direct Preference Optimization (DPO) is favored for its effectiveness and simplicity, eliminating the need for costly reinforcement learning. Various offline algorithms have been developed for different data settings, yet they lack a unified understanding. In this study, we introduce PiorInformed Preference Alignment (PIPA), a unified, RL-free probabilistic framework that formulates language model preference alignment as a Maximum Likelihood Estimation (MLE) problem with prior constraints. This method effectively accommodates both paired and unpaired data, as well as answer and step-level annotations. We illustrate that DPO and KTO are special cases with different prior constraints within our framework. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N. Both algorithms demonstrate a $3\sim10\%$ performance enhancement on the GSM8K and MATH benchmarks across all configurations, achieving these gains without additional training or computational costs compared to existing algorithms.
Paperid:1626
Authors:Xingyu Zhou · Yulian Wu · Francesco Orabona
Abstract:
In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback and direct preference optimization under different privacycorruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models. As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.
Paperid:1627
Authors:Alessandro Montenegro · Marco Mussi · Matteo Papini · Alberto Maria Metelli
Abstract:
Abstract:*Policy gradient* (PG) methods are effective *reinforcement learning* (RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via actionor parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level.This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of $\widetilde{\mathcal{O}}(\varepsilon^{-5})$.Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate $\widetilde{\mathcal{O}}(\epsilon^{-3})$, but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.
Paperid:1628
Authors:Xinting Liao · Weiming Liu · Jiaming Qian · Pengyang Zhou · Jiahe Xu · Wenjie Wang · Chaochao Chen · Xiaolin Zheng · Tat-Seng Chua
Abstract:
Federated prompt learning (FPL) for visionlanguage models is a powerful approach to collaboratively adapt models across distributed clients while preserving data privacy. However, existing FPL approaches suffer from a trade-off between performance and robustness, particularly in out-of-distribution (OOD) shifts, limiting their reliability in real-world scenarios. The inherent in-distribution (ID) data heterogeneity among different clients makes it more challenging to maintain this trade-off. To fill this gap, we introduce a Federated OOD-aware Context Optimization (FOCoOp) framework, which captures diverse distributions among clients using ID global prompts, local prompts, and OOD prompts. Specifically, FOCoOp leverages three sets of prompts to create both class-level and distribution-level separations, which adapt to OOD shifts through bi-level distribution-robustness optimization. Additionally, FOCoOp improves the discrimination between global prompts and OOD prompts using Semi-unbalanced optimal transport, while effectively calibrating the seemly OOD prompts that are aligned with data unseen in individual clients but present in other clients. The extensive experiments on real-world datasets demonstrate that FOCoOp effectively captures decentralized heterogeneous distributions and enhances robustness of different OOD shifts.
Paperid:1629
Authors:Ziyao Wang · Muneeza Azmat · Ang Li · Raya Horesh · Mikhail Yurochkin
Abstract:
Large Language Models (LLMs) often excel in specific domains but fall short in others due to the limitations of their training. Thus, enabling LLMs to solve problems collaboratively by integrating their complementary knowledge promises to improve their performance across domains. To realize this potential, we introduce a novel Collaborative Speculative Decoding (CoSD) algorithm that enables efficient LLM knowledge fusion at test time without requiring additional model training. CoSD employs a draft model to generate initial sequences and an easyto-learn rule or decision tree to decide when to invoke an assistant model to improve these drafts. CoSD not only enhances knowledge fusion but also improves inference efficiency, is transferable across domains, and offers greater explainability. Experimental results demonstrate that CoSD improves accuracy by up to 10% across benchmarks compared to existing methods, providing a scalable and effective solution for LLM-based applications.
Paperid:1630
Authors:Yuxiang Zhao · zhuomin chai · Xun Jiang · Qiang Xu · Runsheng Wang · Yibo Lin
Abstract:
Recent advancements have integrated various deeplearning methodologies into physical design, aiming for workflows acceleration and surpasses human-devised solutions.However, prior research has primarily concentrated on developing task-specific networks, which necessitate a significant investment of time to construct large, specialized datasets, and the unintended isolation of models across different tasks. In this paper, we introduce DeepLayout, the first general representation learning framework specifically designed for backend circuit design. To address the distinct characteristics of post-placement circuits, including topological connectivity and geometric distribution, we propose a hybrid encoding architecture that integrates GNN with spatial transformers. Additionally, the framework includes a flexible decoder module that accommodates a variety of task types, supporting multiple hierarchical outputs such as nets and layouts. To mitigate the high annotation costs associated with layout data, we introduce a mask-based self-supervised learning approach designed explicitly for layout representation. This strategy involves a carefully devised masking approach tailored to layout features, precise reconstruction guidance, and—most critically—two key supervised learning tasks. We conduct extensive experiments on large-scale industrial datasets, demonstrating that DeepLayout surpasses state-of-the-art (SOTA) methods specialized for individual tasks on two crucial layout quality assessment benchmarks. The experiment results underscore the framework’s robust capability to learn the intrinsic properties of circuits.
Paperid:1631
Authors:Michael Kirchhof · James Thornton · Pierre Ablin · Louis Béthune · Eugene Ndiaye · Marco Cuturi
Abstract:
The adoption of textto-image diffusion models raises concerns over reliability, drawing scrutiny under the lens of various metrics like calibration, fairness, or compute efficiency. We focus in this work on two issues that arise when deploying these models: a lack of diversity when prompting images, and a tendency to recreate images from the training set. To solve both problems, we propose a method that coaxes the sampled trajectories of pretrained diffusion models to land on images that fall outside of a reference set. We achieve this by adding repellency terms to the diffusion SDE throughout the generation trajectory, which are triggered whenever the path is expected to land too closely to an image in the shielded reference set. Our method is sparse in the sense that these repellency terms are zero and inactive most of the time, and even more so towards the end of the generation trajectory. Our method, named SPELL for sparse repellency, can be used either with a static reference set that contains protected images, or dynamically, by updating the set at each timestep with the expected images concurrently generated within a batch, and with the images of previously generated batches. We show that adding SPELL to popular diffusion models improves their diversity while impacting their FID only marginally, and performs comparatively better than other recent training-free diversity methods. We also demonstrate how SPELL can ensure a shielded generation away from a very large set of protected images by considering all 1.2M images from ImageNet as the protected set.
Paperid:1632
Authors:Chenyi yang · Wenjie Nie · Yuxin Zhang · Yuhang Wu · Xiawu Zheng · GUANNAN JIANG · Rongrong Ji
Abstract:
N:M sparsity stands as a progressively important tool for DNN compression, achieving practical speedups by stipulating at most N nonzero components within M sequential weights. Unfortunately, most existing works identify the N:M sparse mask through dense backward propagation to update all weights, which incurs exorbitant training costs. In this paper, we introduce BAME, a method that maintains consistent sparsity throughout the N:M sparse training process. BAME perpetually keeps both sparse forward and backward propagation, while iteratively performing weight pruning-and-regrowing within designated weight blocks to tailor the N:M mask. These blocks are selected through a joint assessment based on accumulated mask oscillation frequency and expected loss reduction of mask adaptation, thereby ensuring stable and efficient identification of the optimal N:M mask. Our empirical results substantiate the effectiveness of BAME, illustrating it performs comparably to or better than previous works that fully maintaining dense backward propagation during training. For instance, BAME attains a 72.0% top-1 accuracy while training a 1:16 sparse ResNet-50 on ImageNet, eclipsing SR-STE by 0.5%, despite achieving 2.37 training FLOPs reduction. Code will be released.
Paperid:1633
Authors:Jindong Tong · Hongcheng Liu · Johannes Royset
Abstract:
This paper focuses on nonasymptotic confidence bounds (CB) for the optimal values of stochastic optimization (SO) problems. Existing approaches often rely on two conditions that may be restrictive: The need for a global Lipschitz constant and the assumption of light-tailed distributions. Beyond either of the conditions, it remains largely unknown whether computable CBs can be constructed. In view of this literature gap, we provide three key findings below: (i) Based on the conventional formulation of sample average approximation (SAA), we derive non-Lipschitzian CBs for convex SP problems under heavy tails. (ii) We explore diametrical risk minimization (DRM)---a recently introduced modification to SAA---and attain non-Lipschitzian CBs for nonconvex SP problems in light-tailed settings. (iii) We extend our analysis of DRM to handle heavy-tailed randomness by utilizing properties in formulations for training over-parameterized classification models.
Paperid:1634
Authors:Wenbin Wang · Yongcheng Jing · Liang Ding · Yingjie Wang · Li Shen · Yong Luo · Bo Du · Dacheng Tao
Abstract:
Abstract:Highresolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To overcome the limitations of existing methods, this paper shifts away from prior dedicated heuristic approaches and revisits the most fundamental idea to HR perception by enhancing the long-context capability of MLLMs, driven by recent advances in long-context techniques like retrieval-augmented generation (RAG) for general LLMs. Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43\% improvement on $V^*$ Bench and 19\% on HR-Bench. Code can be found in the supplementary material.
Paperid:1635
Authors:Wangzhi Zhan · Chen Jianpeng · Dongqi Fu · Dawei Zhou
Abstract:
Metamaterials are artificial materials that are designed to meet unseen properties in nature, such as ultrastiffness and negative materials indices. In mechanical metamaterial design, three key modalities are typically involved, i.e., 3D topology, density condition, and mechanical property. Real-world complex application scenarios place the demanding requirements on machine learning models to consider all three modalities together. However, a comprehensive literature review indicates that most existing works only consider two modalities, e.g., predicting mechanical properties given the 3D topology or generating 3D topology given the required properties. Therefore, there is still a significant gap for the state-of-the-art machine learning models capturing the whole. Hence, we propose a unified model named UniMate, which consists of a modality alignment module and a synergetic diffusion generation module. Experiments indicate that UniMate outperforms the other baseline models in topology generation task, property prediction task, and condition confirmation task by 80.2%, 5.1%, and 50.2%. We open-source our proposed UniMate model and corresponding results at https://anonymous.4open.science/status/UniMate-47F2.
Paperid:1636
Authors:Florence Regol · Leo Schwinn · Kyle Sprague · Mark Coates · Thomas L Markovich
Abstract:
A significant challenge in maintaining realworld machine learning models is responding to the continuous and unpredictable evolution of data. Most practitioners are faced with the difficult question: when should I retrain or update my machine learning model? This seemingly straightforward problem is particularly challenging for three reasons: 1) decisions must be made based on very limited information - we usually have access to only a few examples, 2) the nature, extent, and impact of the distribution shift are unknown, and 3) it involves specifying a cost ratio between retraining and poor performance, which can be hard to characterize. Existing works address certain aspects of this problem, but none offer a comprehensive solution. Distribution shift detection falls short as it cannot account for the cost trade-off; the scarcity of the data, paired with its unusual structure, makes it a poor fit for existing offline reinforcement learning methods, and the online learning formulation overlooks key practical considerations. To address this, we present a principled formulation of the retraining problem and propose an uncertainty-based method that makes decisions by continually forecasting the evolution of model performance evaluated with a bounded metric. Our experiments, addressing classification tasks, show that the method consistently outperforms existing baselines on 7 datasets. We thoroughly assess its robustness to varying cost trade-off values and mis-specified cost trade-offs.
Paperid:1637
Authors:Alexandre Capone · Ryan Cosner · Aaron Ames · Sandra Hirche
Abstract:
Control tasks with safety requirements under high levels of model uncertainty are increasingly common. Machine learning techniques are frequently used to address such tasks, typically by leveraging model error bounds to specify robust constraintbased safety filters. However, if the learned model uncertainty is very high, the corresponding filters are potentially invalid, meaning no control input satisfies the constraints imposed by the safety filter. While most works address this issue by assuming some form of safe backup controller, ours tackles it by collecting additional data on the fly using a Gaussian process bandit-type algorithm. We combine a control barrier function with a learned model to specify a robust certificate that ensures safety if feasible. Whenever infeasibility occurs, we leverage the control barrier function to guide exploration, ensuring the collected data contributes toward the closed-loop system safety. By combining a safety filter with exploration in this manner, our method provably achieves safety in a general setting that does not require any prior model or backup controller, provided that the true system lies in a reproducing kernel Hilbert space. To the best of our knowledge, it is the first safe learning-based control method that achieves this.
Paperid:1638
Authors:Míriam Barrabés · Daniel Mas Montserrat · Kapal Dev · Alexander Ioannidis
Abstract:
Feature shifts between data sources are present in healthcare, biomedical, socioeconomic, financial, and multisensor data, among others, as a consequence of heterogeneous data files, noisy measurements, or inconsistent processing and standardization pipelines. Localizing shifted features is important to address the underlying cause of the shift and correct or filter the data to avoid degrading downstream analysis. While many techniques can detect distribution shifts, localizing the features originating them is still challenging, with current solutions being either inaccurate or not scalable to large and high-dimensional datasets. In this work, we introduce the Feature Shift Localization Network (FSL-Net), a neural network that can localize feature shifts in large and high-dimensional datasets in a fast and accurate manner. The network, trained on a large number of datasets, learns to extract the statistical properties of the datasets and can localize feature shifts from previously unseen datasets and shifts without the need for re-training. The code and ready-to-use trained model are available at \url{hidden-for-submission}.
Paperid:1639
Authors:Wei Liu · Junlong Li · Xiwen Zhang · Fan Zhou · Yu Cheng · Junxian He
Abstract:
Abstract:Selfevolving training—where models iteratively learn from their own outputs—has emerged as a key approach for complex reasoning tasks, addressing the scarcity of high-quality chain-of-thought data. However, its effectiveness in multimodal reasoning, a domain more intricate than text-only reasoning, remains underexplored, and the understanding of critical factors in this training paradigm remains limited. Furthermore, a central challenge for this training method is performance saturation, which impedes further improvements and scalability. Inspired by reinforcement learning (RL), in this paper, we reframe self-evolving training for multimodal reasoning through the lens of RL, identifying three pivotal factors: $\textit{Training Method}$, $\textit{Reward Model}$, and $\textit{Prompt Variation}$. Through systematic analysis, we establish relatively optimal design principles that significantly enhance multimodal reasoning capabilities. Moreover, delving deeper into training dynamics, we uncover the roots of saturation and propose a new automatic balancing mechanism to mitigate this limitation. Building on these insights, we propose M-STaR (**M**ultimodal **S**elf-evolving **T**r**a**ining for **R**easoning), a framework that achieves consistent performance gains across models of varying sizes and diverse benchmarks. All resources will be made publicly available.
Paperid:1640
Authors:Vincent Herrmann · Róbert Csordás · Jürgen Schmidhuber
Abstract:
Detecting when a neural sequence model does "interesting" computation is an open problem. The next token prediction loss is a poor indicator: Low loss can stem from trivially predictable sequences that are uninteresting, while high loss may reflect unpredictable but also irrelevant information that can be ignored by the model. We propose a better metric: measuring the model's ability to predict its own future hidden states. We show empirically that this metric–in contrast to the next token prediction loss–correlates with the intuitive interestingness of the task. To measure predictability, we introduce the architectureagnostic "prediction of hidden states" (PHi) layer that serves as an information bottleneck on the main pathway of the network (e.g., the residual stream in Transformers). We propose a novel learned predictive prior that enables us to measure the novel information gained in each computation step, which serves as our metric. We show empirically that our metric predicts the description length of formal languages learned in-context, the complexity of mathematical reasoning problems, and the correctness of self-generated reasoning chains.
Paperid:1641
Authors:Zecheng Hao · Qichao Ma · Kang Chen · Yi Zhang · Zhaofei Yu · Tiejun Huang
Abstract:
Spiking Neural Network (SNN), as a braininspired and energy-efficient network, is currently facing the pivotal challenge of exploring a suitable and efficient learning framework. The predominant training methodologies, namely Spatial-Temporal Back-propagation (STBP) and ANN-SNN Conversion, are encumbered by substantial training overhead or pronounced inference latency, which impedes the advancement of SNNs in scaling to larger networks and navigating intricate application domains. In this work, we propose a novel parallel conversion learning framework, which establishes a mathematical mapping relationship between each time-step of the parallel spiking neurons and the cumulative spike firing rate. We theoretically validate the lossless and sorting properties of the conversion process, as well as pointing out the optimal shifting distance for each step. Furthermore, by integrating the above framework with the distribution-aware error calibration technique, we can achieve efficient conversion towards more general activation functions or training-free circumstance. Extensive experiments have confirmed the significant performance advantages of our method for various conversion cases under ultra-low time latency. To our best knowledge, this is the first work which jointly utilizes parallel spiking calculation and ANN-SNN Conversion, providing a highly promising approach for SNN supervised training.
Paperid:1642
Authors:Abdulkadir Gokce · Martin Schrimpf
Abstract:
When trained on largescale object classification datasets, certain artificial neural network models begin to approximate core object recognition (COR) behaviors and neural response patterns in the primate visual ventral stream (VVS). While recent machine learning advances suggest that scaling model size, dataset size, and compute resources improve task performance, the impact of scaling on brain alignment remains unclear. In this study, we explore scaling laws for modeling the primate VVS by systematically evaluating over 600 models trained under controlled conditions on benchmarks spanning V1, V2, V4, IT and COR behaviors. We observe that while behavioral alignment continues to scale with larger models, neural alignment saturates. This observation remains true across model architectures and training datasets, even though models with stronger inductive bias and datasets with higher-quality images are more compute-efficient. Increased scaling is especially beneficial for higher-level visual areas, where small models trained on few samples exhibit only poor alignment. Our results suggest that while scaling alone might suffice for alignment with human core object recognition behavior, it will not yield improved models of the brain's visual ventral stream with current architectures and datasets, highlighting the need for novel strategies in building brain-like models.
Paperid:1643
Authors:Jongha (Jon) Ryu · Abhin Shah · Gregory Wornell
Abstract:
This paper studies a family of estimators based on noisecontrastive estimation (NCE) for learning unnormalized distributions. The main contribution of this work is to provide a unified perspective on various methods for learning unnormalized distributions, which have been independently proposed and studied in separate research communities, through the lens of NCE. This unified view offers new insights into existing estimators. Specifically, for exponential families, we establish the finite-sample convergence rates of the proposed estimators under a set of regularity assumptions, most of which are new.
Paperid:1644
Authors:Gholamali Aminian · Amir R. Asadi · Tian Li · Ahmad Beirami · Gesine Reinert · Samuel Cohen
Abstract:
Abstract:The generalization error (risk) of a supervised statistical learning algorithm quantifies its prediction ability on previously unseen data. Inspired by exponential tilting, \citet{li2020tilted} proposed the {\it tilted empirical risk} (TER) as a nonlinear risk metric for machine learning applications such as classification and regression problems. In this work, we examine the generalization error of the tilted empirical risk in the robustness regime under \textit{negative tilt}. Our first contribution is to provide uniform and information-theoretic bounds on the {\it tilted generalization error}, defined as the difference between the population risk and the tilted empirical risk, under negative tilt for unbounded loss function under bounded $(1+\epsilon)$-th moment of loss function for some $\epsilon\in(0,1]$ with a convergence rate of $O(n^{-\epsilon/(1+\epsilon)})$ where $n$ is the number of training samples, revealing a novel application for TER under no distribution shift. Secondly, we study the robustness of the tilted empirical risk with respect to noisy outliers at training time and provide theoretical guarantees under distribution shift for the tilted empirical risk. We empirically corroborate our findings in simple experimental setups where we evaluate our bounds to select the value of tilt in a data-driven manner.
Paperid:1645
Authors:Ke Zhu · Shu Yang · Xiaofei Wang
Abstract:
External controls from historical trials or observational data can augment randomized controlled trials when largescale randomization is impractical or unethical, such as in drug evaluation for rare diseases. However, non-randomized external controls can introduce biases, and existing Bayesian and frequentist methods may inflate the type I error rate, particularly in small-sample trials where external data borrowing is most critical. To address these challenges, we propose a randomization inference framework that ensures finite-sample exact and model-free type I error rate control, adhering to the “analyze as you randomize” principle to safeguard against hidden biases. Recognizing that biased external controls reduce the power of randomization tests, we leverage conformal inference to develop an individualized test-then-pool procedure that selectively borrows comparable external controls to improve power. Our approach incorporates selection uncertainty into randomization tests, providing valid post-selection inference. Additionally, we propose an adaptive procedure to optimize the selection threshold by minimizing the mean squared error across a class of estimators encompassing both no-borrowing and full-borrowing approaches. The proposed methods are supported by non-asymptotic theoretical analysis, validated through simulations, and applied to a randomized lung cancer trial that integrates external controls from the National Cancer Database.
Paperid:1646
Authors:Guanglong Sun · Hongwei Yan · Liyuan Wang · Qian Li · Bo Lei · Yi Zhong
Abstract:
Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). While it was originally proposed to train a more compact “student” model from a large “teacher” model, many recent efforts have focused on adapting it as an effective way to promote generalization of the model itself, such as online KD and self KD. Here, we propose an easyto-use and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named spacing effect in the field of biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. We provide an in-depth theoretical and empirical analysis showing that the benefits of the proposed spacing effect in KD stem from seeking a flat minima during stochastic gradient descent (SGD). We perform extensive experiments to demonstrate the effectiveness of our Spaced KD in improving the learning performance of DNNs (e.g., the additional performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively).
Paperid:1647
Authors:Anqi Lu · Junchi Yan
Abstract:
For the fundamental linear programming (LP) problems, the simplex method remains popular, which usually requires an appropriate initial basis as a warm start to accelerate the solving process. Predicting an initial basis close to an optimal one can often accelerate the solver, but a closer initial basis does not always result in greater acceleration. To achieve better acceleration, we propose a GNN model based on a tripartite graph representation inspired by LP duality. This approach enables more effective feature extraction for general LP problems and enhances the expressiveness of GNNs. Additionally, we introduce novel loss functions targeting basic variable selection and basis feasibility, along with data preprocessing schemes, to further improve learning capability. In addition to achieving high prediction accuracy, we enhance the quality of the initial basis for practical use. Experimental results show that our approach greatly surpasses the stateof-the-art method in predicting initial basis with greater accuracy and in reducing the number of iterations and solving time of the LP solver.
Paperid:1648
Authors:Teng Huang · Bin-Bin Jia · Min-Ling Zhang
Abstract:
In multidimensional classification (MDC), the semantics of objects are characterized by multiple class variables from different dimensions. Existing MDC approaches focus on designing effective class dependency modeling strategies to enhance classification performance. However, the intercoupling of multiple class variables poses a significant challenge to the precise modeling of class dependencies. In this paper, we make the first attempt towards escaping from class dependency modeling for addressing MDC problems. Accordingly, a novel MDC approach named DCOM is proposed by decoupling the interactions of different dimensions in MDC. Specifically, DCOM endeavors to identify a latent factor that encapsulates the most salient and critical feature information. This factor will facilitate partial conditional independence among class variables conditioned on both the original feature vector and the learned latent embedding. Once the conditional independence is established, classification models can be readily induced by employing simple neural networks on each dimension. Extensive experiments conducted on benchmark data sets demonstrate that DCOM outperforms other state-of-the-art MDC approaches.
Paperid:1649
Authors:Haoye Lu · Qifan Wu · Yaoliang Yu
Abstract:
Abstract:Recent diffusionbased generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement. A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content. However, through the lens of deconvolution theory, we show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable. To overcome this limitation, we propose to pretrain the model with a small fraction of clean data to guide the deconvolution process. Combined with our Stochastic Forward--Backward Deconvolution (SFBD) method, we attain an FID of $6.31$ on CIFAR-10 with just $4\\%$ clean images (and $3.58$ with $10\\%$). Theoretically, we prove that SFBD guides the model to learn the true data distribution. The result also highlights the importance of pretraining on limited but clean data or the alternative from similar datasets. Empirical studies further support these findings and offer additional insights.
Paperid:1650
Authors:Rajat Rasal · Avinash Kori · Fabio De Sousa Ribeiro · Tian Xia · Ben Glocker
Abstract:
Counterfactual image generation presents significant challenges, including preserving identity, maintaining perceptual quality, and ensuring faithfulness to an underlying causal model. While existing autoencoding frameworks admit semantic latent spaces which can be manipulated for causal control, they struggle with scalability and fidelity. Advancements in diffusion models present opportunities for improving counterfactual image editing, having demonstrated state-of-the-art visual quality, human-aligned perception and representation learning capabilities. Here, we present a suite of diffusion-based causal mechanisms, introducing the notions of spatial, semantic and dynamic abduction. We propose a general framework that integrates semantic representations into diffusion models through the lens of Pearlian causality to edit images via a counterfactual reasoning process. To the best of our knowledge, ours is the first work to consider high-level semantic identity preservation for diffusion counterfactuals and to demonstrate how semantic control enables principled trade-offs between faithful causal control and identity preservation.
Paperid:1651
Authors:Xun Wang · Jing Xu · Franziska Boenisch · Michael Backes · Christopher A. Choquette Choo · Adam Dziedzic
Abstract:
Prompting has become a dominant paradigm for adapting large language models (LLMs).While discrete (textual) prompts are widely used for their interpretability, soft (parameter) prompts have recently gained traction in APIs. This is because they can encode information from more training samples while minimizing the user's token usage, leaving more space in the context window for taskspecific input. However, soft prompts are tightly coupled to the LLM they are tuned on, limiting their generalization to other LLMs. This constraint is particularly problematic forefficiencyandprivacy: (1) tuning prompts on each LLM incurs high computational costs, especially as LLMs continue to grow in size. Additionally, (2) when the LLM is hosted externally, soft prompt tuning often requires sharing private data with the LLM provider. For instance, this is the case with the NVIDIA NeMo API.To address these issues, we propose POST (PrivacyOfSoft promptTransfer), a framework that enables private tuning of soft prompts on a small model and subsequently transfers these prompts to a larger LLM.POST uses knowledge distillation to derive a small model directly from the large LLM to improve prompt transferability, tunes the soft prompt locally---optionally with differential privacy guarantees---and transfers it back to the larger LLM using a small public dataset. Our experiments show that POST reduces computational costs, preserves privacy, and effectively transfers high-utility soft prompts.
Paperid:1652
Authors:Boqi Li · Youjun Wang · Weiwei Liu
Abstract:
Continual learning (CL) focuses on the ability of models to learn sequentially from a stream of tasks. A major challenge in CL is catastrophic forgetting (CF). CF is a phenomenon where the model experiences significant performance degradation on previously learned tasks after training on new tasks. Although CF is commonly observed in convolutional neural networks (CNNs), the theoretical understanding about CF within CNNs remains limited. To fill the gap, we present a theoretical analysis of CF in a twolayer CNN. By employing a multi-view data model, we analyze the learning dynamics of different features throughout CL and derive theoretical insights. The findings are supported by empirical results from both simulated and real-world datasets.
Paperid:1653
Authors:Seungyul Han · Jeongmo Kim · Yisak Park · Minung Kim
Abstract:
Meta reinforcement learning aims to develop policies that generalize to unseen tasks sampled from a task distribution. While contextbased meta-RL methods improve task representation using task latents, they often struggle with out-of-distribution (OOD) tasks. To address this, we propose Task-Aware Virtual Training (TAVT), a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. Our method successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments. Numerical results demonstrate that TAVT significantly enhances generalization to OOD tasks across various MuJoCo and MetaWorld environments.
Paperid:1654
Authors:Hilal Asi · Vinod Raman · Aadirupa Saha
Abstract:
Abstract:We design differentially private algorithms for the problem of prediction with expert advice under dynamic regret, also known as tracking the best expert. Our work addresses three natural types of adversaries, stochastic with shifting distributions, oblivious, and adaptive, and designs algorithms with sublinear regret for all three cases. In particular, under a shifting stochastic adversary where the distribution may shift $S$ times, we provide an $\epsilon$-differentially private algorithm whose expected dynamic regret is at most $O\left( \sqrt{S T \log (NT)} + \frac{S \log (NT)}{\epsilon}\right)$, where $T$ and $N$ are the time horizon and number of experts, respectively. For oblivious adversaries, we give a reduction from dynamic regret minimization to static regret minimization, resulting in an upper bound of $O\left(\sqrt{S T \log(NT)} + \frac{S T^{1/3}\log(T/\delta) \log(NT)}{\epsilon ^{2/3}}\right)$ on the expected dynamic regret, where $S$ now denotes the allowable number of switches of the best expert. Finally, similar to static regret, we establish a fundamental separation between oblivious and adaptive adversaries for the dynamic setting: while our algorithms show that sub-linear regret is achievable for oblivious adversaries in the high-privacy regime $\epsilon \le \sqrt{S/T}$, we show that any $(\epsilon, \delta)$-differentially private algorithm must suffer linear dynamic regret under adaptive adversaries for $\epsilon \le \sqrt{S/T}$. Finally, to complement this lower bound, we give an $\epsilon $-differentially private algorithm that attains sub-linear dynamic regret under adaptive adversaries whenever $\epsilon \gg \sqrt{S/T}$.
Paperid:1655
Authors:Guiping Cao · Tao Wang · Wenjian Huang · Xiangyuan Lan · Jianguo Zhang · Dongmei Jiang
Abstract:
OpenEnded object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, eliminating the need for additional vocabularies during inference compared to Open Vocabulary Detection. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes will be released upon acceptance.
Paperid:1656
Authors:Hongshen Xu · Zichen Zhu · Lei Pan · Zihan Wang · Su Zhu · Da Ma · Ruisheng Cao · Lu Chen · Kai Yu
Abstract:
Large Language Models (LLMs) have expanded their capabilities beyond language generation to interact with external tools, enabling automation and realworld applications. However, tool hallucinations—where models either select inappropriate tools or misuse them—pose significant challenges, leading to erroneous task execution, increased computational costs, and reduced system reliability. To systematically address this issue, we define and categorize tool hallucinations into two main types: tool selection hallucination and tool usage hallucination. To evaluate and mitigate these issues, we introduce RelyToolBench, which integrates specialized test cases and novel metrics to assess hallucination-aware task success and efficiency. Finally, we propose Relign, a reliability alignment framework that expands the tool-use action space to include indecisive actions, allowing LLMs to defer tool use, seek clarification, or adjust tool selection dynamically. Through extensive experiments, we demonstrate that Relign significantly reduces tool hallucinations, improves task reliability, and enhances the efficiency of LLM tool interactions. The code and data will be publicly available.
Paperid:1657
Authors:Mehrdad Moghimi · Hyejin Ku
Abstract:
In domains such as finance, healthcare, and robotics, managing worstcase scenarios is critical, as failure to do so can lead to catastrophic outcomes. Distributional Reinforcement Learning (DRL) provides a natural framework to incorporate risk sensitivity into decision-making processes. However, existing approaches face two key limitations: (1) the use of fixed risk measures at each decision step often results in overly conservative policies, and (2) the interpretation and theoretical properties of the learned policies remain unclear. While optimizing a static risk measure addresses these issues, its use in the DRL framework has been limited to the simple static CVaR risk measure. In this paper, we present a novel DRL algorithm with convergence guarantees that optimizes for a broader class of static Spectral Risk Measures (SRM). Additionally, we provide a clear interpretation of the learned policy by leveraging the distribution of returns in DRL and the decomposition of static coherent risk measures. Extensive experiments demonstrate that our model learns policies aligned with the SRM objective, and outperforms existing risk-neutral and risk-sensitive DRL models in various settings.
Paperid:1658
Authors:Jaeheun Jung · Jaehyuk Lee · ChangHae Jung · Hanyoung Kim · Bosung Jung · Donghun Lee
Abstract:
Shock waves caused by earthquakes can be devastating. Generating realistic earthquakecaused ground motion waveforms help reducing losses in lives and properties, yet generative models for the task tend to generate subpar waveforms. We present High-fidelity Earthquake Groundmotion Generation System (HEGGS) and demonstrate its superior performance using earthquakes from North American, East Asian, and European regions. HEGGS exploits the intrinsic characteristics of earthquake dataset and learns the waveforms using an end-to-end differentiable generator containing conditional latent diffusion model and hi-fidelity waveform construction model. We show the learning efficiency of HEGGS by training it on a single GPU machine and validate its performance using earthquake databases from North America, East Asia, and Europe, using diverse criteria from waveform generation tasks and seismology. Once trained, HEGGS can generate three dimensional E-N-Z seismic waveforms with accurate P/S phase arrivals, envelope correlation, signal-to-noise ratio, GMPE analysis, frequency content analysis, and section plot analysis.
Paperid:1659
Authors:Shuqi Lu · Xiaohong Ji · Lin Yao · Siyuan Liu · Bohang Zhang · Zhifeng Gao · Linfeng Zhang · Guolin Ke
Abstract:
Molecular pretrained representations (MPR) has emerged as a powerful approach for addressing the challenge of limited supervised data in applications such as drug discovery and material design. While early MPR methods relied on 1D sequences and 2D graphs, recent advancements have incorporated 3D conformational information to capture rich atomic interactions. However, these prior models treat molecules merely as discrete atom sets, overlooking the space surrounding them. We argue from a physical perspective that only modeling these discrete points is insufficient. We first present a simple yet insightful observation: naively adding randomly sampled virtual points beyond atoms can surprisingly enhance MPR performance. In light of this, we propose a principled framework that incorporates the entire 3D space spanned by molecules. We implement the framework via a novel Transformerbased architecture, dubbed SpaceFormer, with three key components:(1)grid-based space discretization; (2)grid sampling/merging; and (3)efficient 3D positional encoding.Extensive experiments show that SpaceFormer significantly outperforms previous 3D MPR models across various downstream tasks with limited data, validating the benefit of leveraging the additional 3D space beyond atoms in MPR models.
Paperid:1660
Authors:Sanghoon Yu · Min-hwan Oh
Abstract:
Abstract:We study the linear bandit problem under limited adaptivity, known as the batched linear bandit. While existing approaches can achieve nearoptimal regret in theory, they are often computationally prohibitive or underperform in practice. We propose \texttt{BLAE}, a novel batched algorithm that integrates arm elimination with regularized G-optimal design, achieving (up to logarithmic factors in $T$) the minimax optimal regret in both large-$K$ and small-$K$ regimes for the first time, while using only $O(\log\log T)$ batches. Our analysis introduces new techniques for batch-wise optimal design and refined concentration bounds. Crucially, \texttt{BLAE} demonstrates low computational overhead and strong empirical performance, outperforming state-of-the-art methods in extensive numerical evaluations. Thus, \texttt{BLAE} is the first algorithm to combine provable near-optimality in all regimes and practical superiority in batched linear bandits.
Paperid:1661
Authors:Hengrui Zhang · Liancheng Fang · Qitian Wu · Philip Yu
Abstract:
While autoregressive models dominate natural language generation, their application to tabular data remains limited due to two challenges: 1) tabular data contains heterogeneous types, whereas autoregressive nexttoken (distribution) prediction is designed for discrete data, and 2) tabular data is column permutation-invariant, requiring flexible generation orders. Traditional autoregressive models, with their fixed generation order, struggle with tasks like missing data imputation, where the target and conditioning columns vary. To address these issues, we propose Diffusion-nested Non-autoregressive Transformer (TabNAT), a hybrid model combining diffusion processes and masked generative modeling. For continuous columns, TabNAT uses a diffusion model to parameterize their conditional distributions, while for discrete columns, it employs next-token prediction with KL divergence minimization. A masked Transformer with bi-directional attention enables order-agnostic generation, allowing it to learn the distribution of target columns conditioned on arbitrary observed columns. Extensive experiments on ten datasets with diverse properties demonstrate TabNAT's superiority in both unconditional tabular data generation and conditional missing data imputation tasks.
Paperid:1662
Authors:Tianwei Lin · Wenqiao Zhang · Sijing Li · Yuqian Yuan · Binhe Yu · Haoyuan Li · Wanggui He · Hao Jiang · Mengze Li · Song xiaohui · Siliang Tang · Jun Xiao · Hui Lin · Yueting Zhuang · Beng Chin Ooi
Abstract:
We present HealthGPT, a powerful Medical Large VisionLanguage Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at https://anonymous.4open.science/r/HealthGPT-9533. We will open source the key techniques and dataset to support further research in this direction.
Paperid:1663
Authors:Xiang Zhang · Jiaqi Wei · Sheng Xu · Zijie Qiu · Nanqing Dong · ZhiQiang Gao · Siqi Sun
Abstract:
Peptide sequencing—the process of identifying amino acid sequences from mass spectrometry data—is a fundamental task in proteomics. NonAutoregressive Transformers (NATs) have proven highly effective for this task, outperforming traditional methods. Unlike autoregressive models, which generate tokens sequentially, NATs predict all positions simultaneously, leveraging bidirectional context through unmasked self-attention.However, existing NAT approaches often rely on Connectionist Temporal Classification (CTC) loss, which presentssignificantoptimization challenges due to CTC's complexity and increases the risk of training failures. To address these issues, we propose an improved non-autoregressive peptide sequencing model that incorporates a structured protein sequencecurriculum learningstrategy. This approach adjusts protein's learning difficulty based on the model’s estimated protein generational capabilities through sampling process, progressively learning peptide generationfrom simple to complex sequences. Additionally, we introduce a self-refining inference-time module that iteratively enhances predictions using learned NAT token embeddings, improving sequence accuracy at a fine-grained level. Our curriculum learning strategy reduces NAT training failures frequency by more than 90% based on sampled training over various data distribution. Evaluations on nine benchmark species demonstrate that our approach outperforms all previous methods across multiple metrics and species, from amino acid to peptide level. All results are fully reproducible, with code and trained model weights available in an anonymous repository.
Paperid:1664
Authors:Zachary Brown · David Carlson
Abstract:
The field of hypothesis generation promises to reduce costs in neuroscience by narrowing the range of interventional studies needed to study various phenomena. Existing machine learning methods can generate scientific hypotheses from complex datasets, but many approaches assume causal relationships are static over time, limiting their applicability to systems with dynamic, statedependent behavior, such as the brain. While some techniques attempt dynamic causal discovery through factor models, they often restrict relationships to linear patterns or impose other simplifying assumptions. We propose a novel method that models dynamic graphs as a conditionally weighted superposition of static graphs, where each static graph can capture nonlinear relationships. This approach enables the detection of complex, time-varying interactions between variables beyond linear limitations. Our method improves f1-scores of predicted dynamic causal patterns by roughly 22-28% on average over baselines in some of our experiments, with some improvements reaching well over 60%. A case study on real brain data demonstrates our method's ability to uncover relationships linked to specific behavioral states, offering valuable insights into neural dynamics.
Paperid:1665
Authors:Thomas Chen · Zhiyuan Li · Tengyu Ma
Abstract:
Abstract:Length generalization is the ability of a learner to learn a hypothesis that generalizes to inputs that are longer than the inputs in the training set. Length generalization is important for training systems to perform well on mathematical and algorithmic tasks where instances of large length are scarce in the training data, but where interesting instances have long length. In this paper, we propose a theoretical explanation for when length generalization is possible and when we can get nonasymptotic guarantees on the required length of training data to length generalize. We study a learning setup where the learner is a minimum-complexity interpolator, for a natural complexity measure of functions in the hypothesis class. Within this setup, we first prove a non-asymptotic upper bound of $2n - 2$ for the input-length required to perfectly length generalize on the class of $n$-state Deterministic Finite Automata (DFA). Next, we prove that there is no computable upper bound for the input-length required to perfectly length generalize on the class of Context-Free Grammars (CFGs). The suddenly jump of difficulty of length generalization for CFGs motivated us to look for other realistic function classes with more amenable length generalization guarantees, such as C-RASP: a class of functions $:\{ 0,1\}^* \to \{ 0,1\}$ that is a superset of functions expressible by finite-precision transformers. As our main result, we prove a non-asymptotic upper bound of $O((KT)^{O(K^2)})$ on the input-length required to perfectly length generalize on the class of $2$-layer, $K$-head, $T$-precision C-RASP functions. Our length generalization results apply to the Dyck-$1$ language and its higher-precision variants, showing that inputs of length at most $\mathrm{poly}(T)$ are sufficient to perfectly length generalize.
Paperid:1666
Authors:Qi Xu · Jie Deng · Jiangrong Shen · Biwu Chen · Huajin Tang · Gang Pan
Abstract:
Eventbased object detection has gradually drawn attention for its high time resolution, high dynamic range, and asynchronous address event representation. Spiking Neural Networks (SNNs) offer distinct advantages, including low energy consumption and rich spatiotemporal dynamics. Therefore, in this study, a novel hybrid spike vision Transformer (HsVT) model is proposed, which combines spatial feature extraction and temporal feature extraction components. The spatial feature extraction module is used to extract the local and global features in the input data, and the temporal feature extraction module is used to capture the time dependencies and long-term patterns in the input sequence. With this combination, HsVT models can better capture spatial and temporal features when processing sequence data, thus improving the modeling ability of complex event-based object detection tasks. Besides, we collect the fall detection datasets and make them public as the benchmark of event-based object detection tasks. The Fall detection dataset with event-based camera provides facial privacy protection and saves memory storage due to the event representation manner. We evaluated the performance of HsVT methods on different model sizes and compared their performance on GEN1 and Fall detection datasets. The experimental results show that the HsVT method has achieved remarkable performance improvement in the event detection task with fewer parameters, and provides an effective solution for the event-based object detection task.
Paperid:1667
Authors:Nico Pelleriti · Max Zimmer · Elias Wirth · Sebastian Pokutta
Abstract:
Deep neural networks have reshaped modern machine learning by learning powerful latent representations that often align with the manifold hypothesis: highdimensional data lie on lower-dimensional manifolds. In this paper, we establish a connection between manifold learning and computational algebra by demonstrating how vanishing ideals can characterize the latent manifolds of deep networks. To that end, we propose a new neural architecture that (i) truncates a pretrained network at an intermediate layer, (ii) approximates each class manifold via polynomial generators of the vanishing ideal, and (iii) transforms the resulting latent space into linearly separable features through a single polynomial layer. The resulting models have significantly fewer layers than their pretrained baselines, while maintaining comparable accuracy, achieving higher throughput, and utilizing fewer parameters. Furthermore, drawing on spectral complexity analysis, we derive sharper theoretical guarantees for generalization, showing that our approach can in principle offer tighter bounds than standard deep networks. Numerical experiments confirm the effectiveness and efficiency of the proposed approach.
Paperid:1668
Authors:Itay Yona · Jamie Hayes · Ilia Shumailov · Yossi Gandelsman
Abstract:
Large Language Models (LLMs), despite their impressive capabilities, often fail to accurately repeat a single word when instructed to do so (Barbero et al., 2024). This study links this unexpected behavior to ``attentionsinks'', a mechanism previously identified as crucial for fluency in LLMs, where the initial token in a sequence receives disproportionately high attention. We pinpoint the specific neural circuit responsible for attention-sinks: the first attention layer marks the initial token, and a later MLP neuron adds high-magnitude values to its hidden state, creating the sink. This circuit, consistently observed across various LLMs, is disrupted when processing long repetitions, inducing undesired sinks for each repetition, eventually causing the model to drift from its given instructions. In some cases, this divergence lead to the unintended extraction of training data (Nasr et al., 2023). Based on this understanding, we identify a broader set of non-repeating sequences that trigger the same divergent behavior. Furthermore, we develop a targeted patch that effectively rectifies the issue without impacting overall model performance on other tasks. Our findings highlight the importance of understanding the mechanisms behind well known issues with LLMs. By leveraging mechanistic interpretability techniques, we are able to track down the root cause of the repeated token phenomenon, and ultimately design a fix for this issue. As far as we are aware, this is the first use of mechanistic interpretability to understand and fix a well known privacy issue in LLMs.
Paperid:1669
Authors:Zijing Hu · Fengda Zhang · Kun Kuang
Abstract:
The practical applications of diffusion models have been limited by the misalignment between generated images and corresponding text prompts. Recent studies have introduced direct preference optimization (DPO) to enhance the alignment of these models. However, the effectiveness of DPO is constrained by the issue of visual inconsistency, where the significant visual disparity between wellaligned and poorly-aligned images prevents diffusion models from identifying which factors contribute positively to alignment during fine-tuning. To address this issue, this paper introduces D-Fusion, a method to construct DPO-trainable visually consistent samples. On the one hand, by performing mask-guided self-attention fusion, the resulting images are not only well-aligned, but also visually consistent with given poorly-aligned images. On the other hand, D-Fusion can retain the denoising trajectories of the resulting images, which are essential for DPO training. Extensive experiments demonstrate the effectiveness of D-Fusion in improving prompt-image alignment when applied to different reinforcement learning algorithms.
Paperid:1670
Authors:Jing-Jing Li · Valentina Pyatkin · Max Kleiman-Weiner · Liwei Jiang · Nouha Dziri · Anne Collins · Jana Schaich Borg · Maarten Sap · Yejin Choi · Sydney Levine
Abstract:
Abstract:The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chainof-thought reasoning to analyze its potential consequences by creating a structured "harm-benefit tree," which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impact on any stakeholders. SafetyAnalyst then aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this conceptual framework to develop, test, and release an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On a comprehensive set of prompt safety benchmarks, we show that SafetyAnalyst (average F1=0.81) outperforms existing LLM safety moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.
Paperid:1671
Authors:ning wang · Zekun Li · Man Yao · Zhen Qin · Tongxin Bai · Guoqi Li
Abstract:
Linear attention offers the advantages of linear inference time and fixed memory usage compared to Softmax attention. However, training largescale language models with linear attention from scratch remains prohibitively expensive and exhibits significant performance gaps compared to Softmax-based models. To address these challenges, we focus on transforming pre-trained Softmax-based language models into linear attention models. We unify mainstream linear attention methods using a \textbf{high-order QK integration theory} and a \textbf{multi-level vocabulary decomposition}. Specifically, the QK integration theory explains the efficacy of combining linear and sparse attention from the perspective of information collection across different frequency bands. The multi-level vocabulary decomposition exponentially expands memory capacity by recursively exploiting compression loss from compressed states. Through detailed error analysis, we demonstrate superior approximation of Softmax attention achieved by our approach. To further improve performance and reduce training costs, we adopt a \textbf{soft integration strategy} with attention scores, effectively combining a sliding window mechanism. With less than 100M tokens, our method fine-tunes models to achieve linear complexity while retaining 99\% of their original performance. Compared to state-of-the-art linear attention model and method, our approach improves MMLU scores by 1.2 percentage points with minimal fine-tuning. Furthermore, even without the sliding window mechanism, our method achieves state-of-the-art performance on all test sets with 10B tokens.
Paperid:1672
Authors:Yalan Qin · Guorui Feng · Xinpeng Zhang
Abstract:
Multiview clustering aims to improve the final performance by taking advantages of complementary and consistent information of all views. In real world, data samples with partially available information are common and the issue regarding the clustering for incomplete multi-view data is inevitably raised. To deal with the partial data with large scales, some fast clustering approaches for incomplete multi-view data have been presented. Despite the significant success, few of these methods pay attention to learning anchors with high quality in a unified framework for incomplete multi-view clustering, while ensuring the scalability for large-scale incomplete datasets. In addition, most existing approaches based on incomplete multi-view clustering ignore to build the relation between anchor graph and similarity matrix in symmetric nonnegative matrix factorization and then directly conduct graph partition based on the anchor graph to reduce the space and time consumption. In this paper, we propose a novel fast incomplete multi-view clustering method for the data with large scales, termed Fast Incomplete Multi-view clustering by flexible anchor Learning (FIML), where graph construction, anchor learning and graph partition are simultaneously integrated into a unified framework for fast incomplete multi-view clustering. To be specific, we learn a shared anchor graph to guarantee the consistency among multiple views and employ a adaptive weight coefficient to balance the impact for each view. The relation between anchor graph and similarity matrix in symmetric nonnegative matrix factorization can also be built, i.e., each entry in the anchor graph can characterize the similarity between the anchor and original data sample. We then adopt an alternative algorithm for solving the formulated problem. Experiments conducted on different datasets confirm the superiority of FIML compared with other clustering methods for incomplete multi-view data.
Paperid:1673
Authors:Junhao Dong · Piotr Koniusz · Yifei Zhang · Hao Zhu · Weiming Liu · Xinghua Qu · Yew Soon ONG
Abstract:
Abstract:VisionLanguage Models (VLMs), e.g., CLIP, excel at zero-shot classification but are vulnerable to adversarial examples, posing security risks. Adversarial fine-tuning robustifies zero-shot models by aligning prediction scores $g\_\theta(\cdot)$ of individual adversaries $x+\delta\_x$ with their clean counterparts $x$ by minimizing $\min\limits\_{\theta}\lVert g\_\theta(x+\delta_x)-g\_\theta(x)\rVert_2$ w.r.t. parameters $\theta$, which overlooks intermediate adversarial samples along the adversarial trajectory-the path crossing the decision boundary. Wang and Lu showed that intermediate adversaries and their vicinity offer quality adversaries. Gao et al. suggested to sample adversarial candidates from 2D simplices formed by joining vertex $x$ with consecutive vertices $x+\delta\_{x,i}$ and $x+\delta\_{x,{i+1}}$ on adversarial trajectory. However, sampling 2D simplices is prohibitively costly, and each such sample has to be fed into $\min\limits\_{\theta}\underbrace{{\scriptstyle\frac{1}{\kappa}}\sum\nolimits\_{\delta\_x\in\Delta\_X}\lVert g\_\theta(x+\delta\_x)-g\_\theta(x)\rVert\_2}\_{\Omega}$ where cardinality $\kappa=\lvert\Delta_X\rvert$ means the number of samples grew $\kappa$ times. To train robust VLM, we overcome these limitations by Taylor expansion of $g\_\theta(x+\delta\_x)$ around $x$ and formulating an upper-bound of approximated alignment loss $\Omega$ that depends on the closed-form $\Sigma\_x$ instead of a costly second-order matrix $\hat{\Sigma}\_x={\scriptstyle\frac{1}{\kappa}}\sum\nolimits\_{\delta\_x\in\Delta_X}\delta\_x\delta\_x^T$, and Jacobian/Hessian $\big(J\_{g}(x), (H\_{g}(x))\_c\big)$ of $g\_\theta(\cdot)$.As regions between clean and intermediate adversarial samples capture a larger decision landscape, we robustify VLM by plausible adversaries on simplices by our closed-form formulations equivalent to infinite uniform sampling. We obtain state-of-the-art with $10\times$ speed-up on 15 datasets.
Paperid:1674
Authors:Kun Wang · Sumanth Varambally · Duncan Watson-Parris · Yian Ma · Rose Yu
Abstract:
Many important phenomena in scientific fields such as climate, neuroscience, and epidemiology are naturally represented as spatiotemporal gridded data with complex interactions. Inferring causal relationships from these data is a challenging problem compounded by the high dimensionality of such data and the correlations between spatially proximate points. We present SPACY (SPAtiotemporal Causal discoverY), a novel framework based on variational inference, designed to model latent time series and their causal relationships from spatiotemporal data. SPACY alleviates the highdimensional challenge by discovering causal structures in the latent space. To aggregate spatially proximate, correlated grid points, we use spatial kernels to map observational time series to latent representations. Theoretically, we generalize the problem to a continuous spatial domain, and establish the identifiability of our model under minimal assumptions about the latent generation process. Notably, we do not assume the absence of instantaneous effects or require conditions of sufficient variability, which are hard to verify in practice. Empirically, SPACY outperforms state-of-the-art baselines on synthetic data, even in challenging settings where existing methods struggle, while remaining scalable for large grids. SPACY also identifies key known phenomena from real-world climate data.
Paperid:1675
Authors:Shuowei Jin · Xueshen Liu · Qingzhao Zhang · Zhuoqing Morley Mao
Abstract:
Large Language Models (LLMs) are increasingly deployed in largescale online services, enabling sophisticated applications. However, the computational overhead of generating key-value (KV) caches in the prefill stage presents a major bottleneck, particularly for long-context inputs. Prefix caching mitigates this issue by storing KV caches for reuse, reducing redundant computation. Despite its advantages, prefix caching suffers from high latency due to the limited I/O bandwidth of storage devices, constraining inference efficiency. To address this challenge, we introduce Cake, a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel. Cake employs a bidirectional scheduling strategy that dynamically balances KV cache computation and loading, ensuring efficient resource utilization. Additionally, Cake incorporates an adaptive scheduling mechanism that seamlessly integrates with non-prefix caching requests, improving system throughput and adapting to fluctuating resource availabilty. Through extensive evaluations across various hardware configurations, datasets, and storage conditions, Cake achieves on average 2.6× reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods. Our findings highlight Cake as an effective and practical solution for optimizing long-context LLM inference, bridging the gap between computation and I/O efficiency in large-scale AI deployments.
Paperid:1676
Authors:Gabriele Visentin · Patrick Cheridito
Abstract:
We present a novel method for efficiently computing optimal transport maps and Wasserstein barycenters in highdimensional spaces. Our approach uses conditional normalizing flows to approximate the input distributions as invertible pushforward transformations from a common latent space. This makes it possible to directly solve the primal problem using gradient-based minimization of the transport cost, unlike previous methods that rely on dual formulations and complex adversarial optimization. We show how this approach can be extended to compute Wasserstein barycenters by solving a conditional variance minimization problem. A key advantage of our conditional architecture is that it enables the computation of barycenters for hundreds of input distributions, which was computationally infeasible with previous methods. Our numerical experiments illustrate that our approach yields accurate results across various high-dimensional tasks and compares favorably to previous state-of-the-art methods.
Paperid:1677
Authors:Aadirupa Saha · Tomer Koren · Yishay Mansour
Abstract:
Abstract:We address the problem of convex optimization with dueling feedback, where the goal is to minimize a convex function given a weaker form of \emph{dueling} feedback. Each query consists of two points and the dueling feedback returns a (noisy) singlebit binary comparison of the function values of the two queried points.The translation of the function values to the single comparison bit is through a \emph{transfer function}.This problem has been addressed previously for some restricted classes of transfer functions, but here we consider a very general transfer function class which includes all functions that admit a series expansion about the origin.Our main contribution is an efficient algorithm with convergence rate of $\smash{O}(\epsilon^{-4p})$ for smooth convex functions, and an optimal rate of $\smash{\widetilde O}(\epsilon^{-2p})$ when the objective is both smooth and strongly convex, where $p$ is the minimal degree (with a non-zero coefficient) in the transfer's series expansion about the origin.
Paperid:1678
Authors:Xingxing Yang · Jie Chen · Zaifeng Yang
Abstract:
This paper presents a novel approach to hyperspectral image (HSI) reconstruction from RGB images, addressing fundamental limitations in existing learningbased methods from a physical perspective. We discuss and aim to address the ``colorimetric dilemma": failure to consistently reproduce ground-truth RGB from predicted HSI, thereby compromising physical integrity and reliability in practical applications. To tackle this issue, we propose PhySpec, a physically consistent framework for robust HSI reconstruction. Our approach fundamentally exploits the intrinsic physical relationship between HSIs and corresponding RGBs by employing orthogonal subspace decomposition, which enables explicit estimation of camera spectral sensitivity (CSS). This ensures that our reconstructed spectra align with well-established physical principles, enhancing their reliability and fidelity. Moreover, to efficiently use internal information from test samples, we propose a self-supervised meta-auxiliary learning (MAXL) strategy that rapidly adapts the trained parameters to unseen samples using only a few gradient descent steps at test time, while simultaneously constraining the generated HSIs to accurately recover ground-truth RGB values. Thus, MAXL reinforces the physical integrity of the reconstruction process. Extensive qualitative and quantitative evaluations validate the efficacy of our proposed framework, showing superior performance compared to SOTA methods.
Paperid:1679
Authors:Zhenglin Zhou · Xiaobo Xia · Fan Ma · Hehe Fan · Yi Yang · Tat-Seng Chua
Abstract:
Textto-3D generation automates 3D content creation from textual descriptions, which offers transformative potential across various fields. However, existing methods often struggle to align generated content with human preferences, limiting their applicability and flexibility. To address these limitations, in this paper, we propose DreamDPO, an optimization-based framework that integrates human preferences into the 3D generation process, through direct preference optimization. Practically, DreamDPO first constructs pairwise examples, then validates their alignment with human preferences using reward or large multimodal models, and lastly optimizes the 3D representation with a preference-driven loss function. By leveraging relative preferences, DreamDPO reduces reliance on precise quality evaluations while enabling fine-grained controllability through preference-guided optimization. Experiments demonstrate that DreamDPO achieves state-of-the-art results, and provides higher-quality and more controllable 3D content compared to existing methods. The code and models will be open-sourced.
Paperid:1680
Authors:Martino Bernasconi · Andrea Celli · Riccardo Colini Baldeschi · Federico Fusco · Stefano Leonardi · Matteo Russo
Abstract:
In the randomorder model for online learning, the sequence of losses is chosen upfront by an adversary and presented to the learner after a random permutation. Any random-order input isasymptoticallyequivalent to a stochastic i.i.d.~one, but, for finite times, it may exhibit significantnon-stationarity, which can hinder the performance of stochastic learning algorithms.While algorithms for adversarial inputs naturally maintain their regret guarantees in random order, simple no-regret algorithms exist for the stochastic model that fail against random-order instances. In this paper, we propose a general procedure to adapt stochastic learning algorithms to the random-order model without substantially affecting their regret guarantees. This allows us to recover improved regret bounds for prediction with delays, bandits with switching costs, and online learning with constraints. Finally, we investigate online classification and prove that, in random order, learnability is characterized by the VC dimension rather than by the Littlestone dimension, thus providing a further separation from the general adversarial model.
Paperid:1681
Authors:Zhengming Chen · Yewei Xia · Feng Xie · Jie Qiao · Zhifeng Hao · Ruichu Cai · Kun Zhang
Abstract:
We study the problem of learning discrete latent variable causal structures from mixedtype observational data. Traditional methods, such as those based on the tensor rank condition, are designed to identify discrete latent structure models and provide robust identification bounds for discrete causal models. However, when observed variables—specifically, those representing the children of latent variables—are collected at various levels with continuous data types, the tensor rank condition is not applicable, limiting further causal structure learning for latent variables. In this paper, we consider a more general case where observed variables can be either continuous or discrete, and further allow for scenarios where multiple latent parents cause the same set of observed variables. We show that, under the completeness condition, it is possible to discretize the data in a way that satisfies the full-rank assumption required by the tensor rank condition. This enables the identifiability of discrete latent structure models within mixed-type observational data. Moreover, we introduce the two-sufficient measurement condition, a more general structural assumption under which the tensor rank condition holds and the underlying latent causal structure is identifiable by a proposed two-stage identification algorithm. Extensive experiments on both simulated and real-world data validate the effectiveness of our method.
Paperid:1682
Authors:Chen Hao Xia · Manasa Kaniselvan · Alexandros Nikolaos Ziogas · Marko Mladenovic · Rayen Mahjoub · Alexander Maeder · Mathieu Luisier
Abstract:
Abstract:Graph neural networks (GNNs) have shown promise in learning the groundstate electronic properties of materials, subverting *ab initio* density functional theory (DFT) calculations when the underlying lattices can be represented as small and/or repeatable unit cells (i.e., molecules and periodic crystals). Realistic systems are, however, non-ideal and generally characterized by higher structural complexity. As such, they require large (10+ Å) unit cells and thousands of atoms to be accurately described. At these scales, DFT becomes computationally prohibitive, making GNNs especially attractive. In this work, we present a strictly local equivariant GNN capable of learning the electronic Hamiltonian (**H**) of realistically extended materials. It incorporates an *augmented partitioning* approach that enables training on arbitrarily large structures while preserving local atomic environments beyond boundaries. We demonstrate its capabilities by predicting the electronic Hamiltonian of various systems with up to 3,000 nodes (atoms), 500,000+ edges, ~ 28 million orbital interactions (nonzero entries of **H**), and $\leq$0.55\% error in the eigenvalue spectra. Our work expands the applicability of current electronic property prediction methods to some of the most challenging cases encountered in computational materials science, namely systems with disorder, interfaces, and defects.
Paperid:1683
Authors:Oscar Skean · Md Rifat Arefin · Dan Zhao · Niket Patel · Jalal Naghiyev · Yann LeCun · Ravid Shwartz-Ziv
Abstract:
From extracting features to generating text, the outputs of large language models (LLMs) typically rely on their final layers, following the conventional wisdom that earlier layers capture only lowlevel cues. However, our analysis shows that intermediate layers can encode even richer representations, often improving performance on a wide range of downstream tasks. To explain and quantify these hidden-layer properties, we propose a unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations. Our framework highlights how each model layer balances information compression and signal preservation, revealing why mid-depth embeddings can exceed the last layer’s performance. Through extensive experiments on 32 text-embedding tasks and comparisons across model architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features. These findings challenge the standard focus on final-layer embeddings and open new directions for model analysis and optimization, including strategic use of mid-layer representations for more robust and accurate AI systems.
Paperid:1684
Authors:Yujun Shi · Jun Hao Liew · Hanshu YAN · Vincent Tan · Jiashi Feng
Abstract:
Accuracy and speed are critical in image editing tasks. Pan et al. introduced a dragbased framework using Generative Adversarial Networks, and subsequent studies have leveraged large-scale diffusion models. However, these methods often require over a minute per edit and exhibit low success rates. We present LightningDrag, which achieves high-quality drag-based editing in about one second on general images. By redefining drag-based editing as a conditional generation task, we eliminate the need for time-consuming latent optimization or gradient-based guidance. Our model is trained on large-scale paired video frames, capturing diverse motion (object translations, pose shifts, zooming, etc.) to significantly improve accuracy and consistency. Despite being trained only on videos, our model generalizes to local deformations beyond the training data (e.g., lengthening hair, twisting rainbows). Extensive evaluations confirm the superiority of our approach, and we will release both code and model.
Paperid:1685
Authors:Mohamad Chehade · Wenting Li · Brian Bell · Russell Bent · Saif Kazi · Hao Zhu
Abstract:
Abstract:The robustness of neural networks is crucial in safetycritical applications, where identifying a reliable input space is essential for effective model selection, robustness evaluation, and the development of reliable control strategies. Most existing robustness verification methods assess the worst-case output under the assumption that the input space is known. However, precisely identifying a verifiable input space $ \mathcal{C} $, where no adversarial examples exist, is challenging due to the possible high dimensionality, discontinuity, and non-convex nature of the input space. To address this challenge, we propose a novel framework, **LEVIS**, comprising **LEVIS-$\alpha$** and **LEVIS-$\beta$**. **LEVIS-$\alpha$** identifies a single, large verifiable ball that intersects at least two boundaries of a bounded region $ \mathcal{C} $, while **LEVIS-$\beta$** systematically captures the entirety of the verifiable space by integrating multiple verifiable balls. Our contributions are fourfold: we introduce a verification framework, **LEVIS**, incorporating two optimization techniques for computing nearest and directional adversarial points based on mixed-integer programming (MIP); to enhance scalability, we integrate complementary constrained (CC) optimization with a reduced MIP formulation, achieving up to a 17-fold reduction in runtime by approximating the verifiable region in a principled way; we provide a theoretical analysis characterizing the properties of the verifiable balls obtained through **LEVIS-$\alpha$**; and we validate our approach across diverse applications, including electrical power flow regression and image classification, demonstrating performance improvements and visualizing the geometric properties of the verifiable region.
Paperid:1686
Authors:Kevin Xia · Elias Bareinboim
Abstract:
The study of causal abstractions bridges two integral components of human intelligence: the ability to determine cause and effect, and the ability to interpret complex patterns into abstract concepts. Formally, causal abstraction frameworks define connections between complicated lowlevel causal models and simple high-level ones. One major limitation of most existing definitions is that they are not well-defined when considering lossy abstraction functions in which multiple low-level interventions can have different effects while mapping to the same high-level intervention (an assumption called the abstract invariance condition). In this paper, we introduce a new type of abstractions called projected abstractions that generalize existing definitions to accommodate lossy representations. We show how to construct a projected abstraction from the low-level model and how it translates equivalent observational, interventional, and counterfactual causal queries from low to high-level. Given that the true model is rarely available in practice we prove a new graphical criteria for identifying and estimating high-level causal queries from limited low-level data. Finally, we experimentally show the effectiveness of projected abstraction models in high-dimensional image settings.
Paperid:1687
Authors:Chen Zeno · Hila Manor · Gregory Ongie · Nir Weinberger · Tomer Michaeli · Daniel Soudry
Abstract:
Abstract:While diffusion models generate highquality images via probability flow, the theoretical understanding of this process remains incomplete. A key question is when probability flow converges to training samples or more general points on the data manifold.We analyze this by studying the probability flow of shallow ReLU neural network denoisers trained with minimal $\ell^2$ norm. For intuition, we introduce a simpler score flow and show that for orthogonal datasets, both flows follow similar trajectories, converging to a training point or a sum of training points. However, early stopping by the scheduler allows probability flow to reach more general manifold points.This aligns with observations that diffusion models memorize and reproduce training samples while also combining elements to form a “semantic sum.” We extend these results to obtuse simplex data and, through simulations in the orthogonal case, confirm that probability flow converges to a training point, a sum of training points, or a manifold point. Moreover, memorization decreases when the number of training samples grows, as fewer samples accumulate near training points.
Paperid:1688
Authors:Chenrui Wang · Yixuan Qiu
Abstract:
The entropicregularized optimal transport (OT) has gained massive attention in machine learning due to its ability to provide scalable solutions for OT-based tasks. However, most of the existing algorithms, including the Sinkhorn algorithm and its extensions, behave like first-order optimization methods, thus suffering from relatively slow convergence. More recently, some second-order methods have been proposed based on the idea of Hessian sparsification. Despite their promising results, they have two major issues: first, there is limited theoretical understanding on the effect of sparsification; second, in cases where the transport plan is dense, Hessian sparsification does not perform well. In this paper, we propose a new quasi-Newton method to address these problems. First, we develop new theoretical analyses to understand the benefits of Hessian sparsification, which lays the foundation for highly flexible sparsification schemes. Then we introduce an additional low-rank term in the approximate Hessian to better handle the dense case. Finally, convergence properties of the proposed algorithm are rigorously analyzed, and various numerical experiments are conducted to demonstrate its improved performance in solving large-scale OT problems.
Paperid:1689
Authors:Benson Chen · Tomasz Danel · Gabriel Dreiman · Patrick McEnaney · Nikhil Jain · Kirill Novikov · Spurti Akki · Joshua L. Turnbull · Virja Pandya · Boris Belotserkovskii · Jared Weaver · Ankita Biswas · Dat Nguyen · Kent Gorday · Mohammad M Sultan · Nathaniel Stanley · Daniel Whalen · Divya Kanichar · Christoph Klein · Emily Fox · R. Watts
Abstract:
DNAEncoded Libraries (DELs) represent a transformative technology in drug discovery, facilitating the high-throughput exploration of vast chemical spaces. Despite their potential, the scarcity of publicly available DEL datasets presents a bottleneck for the advancement of machine learning methodologies in this domain. To address this gap, we introduce KinDEL, one of the largest publicly accessible DEL datasets and the first one that includes binding poses from molecular docking experiments. Focused on two kinases, Mitogen-Activated Protein Kinase 14 (MAPK14) and Discoidin Domain Receptor Tyrosine Kinase 1 (DDR1), KinDEL includes 81 million compounds, offering a rich resource for computational exploration. Additionally, we provide comprehensive biophysical assay validation data, encompassing both on-DNA and off-DNA measurements, which we use to evaluate a suite of machine learning techniques, including novel structure-based probabilistic models. We hope that our benchmark, encompassing both 2D and 3D structures, will help advance the development of machine learning models for data-driven hit identification using DELs.
Paperid:1690
Authors:Rogerio Bonatti · Dan Zhao · Francesco Bonacci · Dillon Dupont · Sara Abdali · Yinheng Li · Yadong Lu · Justin Wagle · Kazuhito Koishida · Arthur Bucker · Lawrence Jang · Zheng Hui
Abstract:
Abstract:Capable large language models (LLMs) show potential as computer agents, enhancing human productivity and software accessibility in multimodal tasks. However, measuring agent performance in sufficiently realistic and complex environments becomes increasingly challenging as: (i) most benchmarks are limited to specific modalities/domains (e.g., text-only, web navigation, Q\&A) and (ii) full benchmark evaluations are slow (on order of magnitude of multiple hours or days) given the multi-step sequential nature of tasks.To address these challenges, we introduce Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We create 150+ diverse tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage.Our benchmark is scalable and can be seamlessly parallelized for a full benchmark evaluation in as little as $20$ minutes. Our work not only speed up the development and evaluation of multi-modal agents, but also highlights and analyzes existing shortfalls in the agentic abilities of several multimodal LLMs as agents within the Windows computing environment---with the best achieving only a 19.5\% success rate compared to a human success rate of 74.5\%.
Paperid:1691
Authors:Haoyue Dai · Yiwen Qiu · Ignavier Ng · Xinshuai Dong · Peter Spirtes · Kun Zhang
Abstract:
Addressing selection bias in latent variable causal discovery is important yet underexplored, largely due to a lack of suitable statistical tools: While various tools beyond basic conditional independencies have been developed to handle latent variables, none have been adapted for selection bias. We make an attempt by studying rank constraints, which, as a generalization to conditional independence constraints, exploits the ranks of covariance submatrices in linear Gaussian models. We show that although selection can significantly complicate the joint distribution, interestingly, the ranks in the biased covariance matrices still preserve meaningful information about both causal structures and selection mechanisms. We provide a graphtheoretic characterization of such rank constraints. Using this tool, we demonstrate that the one-factor model, a classical latent variable model, can be identified under selection bias. Simulations and real-world experiments confirm the effectiveness of using our rank constraints.
Paperid:1692
Authors:Hanglei Hu · Yingying Guo · Zhikang Chen · Sen Cui · Fei Wu · Kun Kuang · Min Zhang · Bo Jiang
Abstract:
Personalized learning, especially databased methods, has garnered widespread attention in recent years, aiming to meet individual student needs. However, many works rely on the implicit assumption that benchmarks are high-quality and well-annotated, which limits their practical applicability. In real-world scenarios, these benchmarks often exhibit long-tail distributions, significantly impacting model performance. To address this challenge, we propose a novel method calledNeural-Collapse-Advanced personalizedLearning (NCAL), designed to learn features that conform to the same simplex equiangular tight frame (ETF) structure. NCAL introduces Text-modality Collapse (TC) regularization to optimize the distribution of text embeddings within the large language model (LLM) representation space. Notably, NCAL is model-agnostic, making it compatible with various architectures and approaches, thereby ensuring broad applicability. Extensive experiments demonstrate that NCAL effectively enhances existing works, achieving new state-of-the-art performance. Additionally, NCAL mitigates class imbalance, significantly improving the model’s generalization ability.
Paperid:1693
Authors:Liyao Lyu · Jingrun Chen
Abstract:
We propose agradientfreedeep reinforcement learning algorithm to solvehigh-dimensional, finite-horizon stochastic control problems. Although the recently developed deep reinforcement learning framework has achieved great success in solving these problems, direct estimation of policy gradients from Monte Carlo sampling often suffers from high variance. To address this, we introduce the Momentum Consensus-Based Optimization (M-CBO) and Adaptive Momentum Consensus-Based Optimization (Adam-CBO) frameworks. These methods optimize policies using Monte Carlo estimates of the value function, rather than its gradients. Adjustable Gaussian noise supports efficient exploration, helping the algorithm converge to optimal policies in complex, nonconvex environments. Numerical results confirm the accuracy and scalability of our approach across various problem dimensions and show the potential for extension to mean-field control problems. Theoretically, we prove that M-CBO can converge to the optimal policy under some assumptions.
Paperid:1694
Authors:Vignav Ramesh · Kenneth Li
Abstract:
Abstract:Communication between multiple language model (LM) agents has been shown to scale up the reasoning ability of LMs. While natural language has been the dominant medium for interLM communication, it is not obvious this should be the standard: not only does natural language communication incur high inference costs that scale quickly with the number of both agents and messages, but also the decoding process abstracts away too much rich information that could be otherwise accessed from the internal activations. In this work, we propose a simple technique whereby LMs communicate via *activations*; concretely, we pause an LM $B$'s computation at an intermediate layer, combine its current activation with another LM $A$'s intermediate activation via some function $f$, then pass $f$'s output into the next layer of $B$ and continue the forward pass till decoding is complete. This approach scales up LMs on new tasks with *zero* additional parameters and data, and saves a *substantial amount of compute* over natural language communication. We test our method with various functional forms $f$ on two experimental setups—multi-player coordination games and reasoning benchmarks—and find that it achieves up to $27$% improvement over natural language communication across datasets with $<$$1/4$ the compute, illustrating the superiority and robustness of activations as an alternative "language" for communication between LMs.
Paperid:1695
Authors:Ziyang Yu · Wenbing Huang · Yang Liu
Abstract:
Molecular Dynamics (MD) simulations are essential for understanding the atomiclevel behavior of molecular systems, giving insights into their transitions and interactions. However, classical MD techniques are limited by the trade-off between accuracy and efficiency, while recent deep learning-based improvements have mostly focused on single-domain molecules, lacking transferability to unfamiliar molecular systems. Therefore, we propose \textbf{Uni}fied \textbf{Sim}ulator (UniSim), which leverages cross-domain knowledge to enhance the understanding of atomic interactions. First, we employ a multi-task pretraining approach to learn a unified atomic representation model from a large and diverse set of molecular data. Then, based on the stochastic interpolant framework, we learn the state transition patterns over long timesteps from MD trajectories, and introduce a force guidance module for rapidly adapting to different chemical environments. Our experiments demonstrate that UniSim achieves highly competitive performance across small molecules, peptides, and proteins.
Paperid:1696
Authors:Max Hopkins · Shay Moran
Abstract:
Abstract:Stability is a central property in learning and statistics promising the output of an algorithm $\mathcal{A}$ does not change substantially when applied to similar datasets $S$ and $S'$. It is an elementary fact that any sufficiently stable algorithm (e.g. one returning the same result with high probability, satisfying privacy guarantees, etc.) must be randomized. This raises a natural question: can we quantify *how much* randomness is needed for algorithmic stability?We study the randomness complexity of two influential notions of stability in learning: *replicability*, which promises $\mathcal{A}$ usually outputs the same result when run over samples from the same distribution (and shared random coins), and *differential privacy*, which promises the output distribution of $\mathcal{A}$ remains similar under neighboring datasets. The randomness complexity of these notions was studied recently in (Dixon, Pavan, Vander Woude, and Vinodchandran ICML 2024) and (Cannone, Su, and Vadhan ITCS 2024) for basic $d$dimensional tasks (e.g. estimating the bias of $d$ coins), but little is known about the measures more generally or in complex settings like classification. Toward this end, we prove a "weak-to-strong" boosting theorem for stability: the randomness complexity of a task $\mathcal{M}$ (either under replicability or DP) is tightly controlled by the best replication probability of any *deterministic* algorithm solving $\mathcal{M}$, a weak measure called $\mathcal{M}$'s "global stability" that is universally capped at $\frac{1}{2}$ (Chase, Moran, Yehudayoff FOCS 2023). Using this connection, we characterize the randomness complexity of PAC Learning: a class has bounded randomness complexity iff it has finite Littlestone dimension, and moreover scales at worst logarithmically in the excess error of the learner. This resolves a question of (Chase, Chornomaz, Moran, and Yehudayoff STOC 2024) who asked for such a characterization in the equivalent language of `list-replicability'.
Paperid:1697
Authors:Jiahai Feng · Stuart Russell · Jacob Steinhardt
Abstract:
Pretrained language models (LMs) can generalize to implications of facts that they are finetuned on. For example, if finetuned on "John Doe lives in Tokyo," LMs correctly answer "What language do the people in John Doe's city speak?'' with "Japanese''. However, little is known about the mechanisms that enable this generalization or how they are learned during pretraining.We introduce extractive structures as a framework for describing how components in LMs (e.g., MLPs or attention heads) coordinate to enable this generalization. The structures consist of informative components that store training facts as weight changes, and upstream and downstream extractive components that query and process the stored information to produce the correct implication. We hypothesize that extractive structures are learned during pretraining when encountering implications of previously known facts. This yields two predictions: a data ordering effect where extractive structures can be learned only if facts precede their implications, and a weight grafting effect where extractive structures can be grafted to predict counterfactual implications.We empirically show these effects in the OLMo7b, Llama 3-8b, Gemma 2-9b, and Qwen 2-7b models.Of independent interest, our results also indicate that fact learning can occur at both early and late layers, which lead to different forms of generalization.
Paperid:1698
Authors:Haike Xu · Zongyu Lin · Kai-Wei Chang · Yizhou Sun · Piotr Indyk
Abstract:
Abstract:Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query, which is important to many downstream applications like fact checking and data cleaning. To retrieve contradiction argument to the query from large document corpora, existing methods such as similarity search and crossencoder models exhibit different limitations.To address these challenges, we introduce a novel approach: SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences. Our method utilizes a combined metric of cosine similarity and a sparsity function to efficiently identify and retrieve documents that contradict a given query. This approach dramatically enhances the speed of contradiction detection by reducing the need for exhaustive document comparisons to simple vector calculations. We conduct contradiction retrieval experiments on Arguana, MSMARCO, and HotpotQA, where our method produces an average improvement of $11.0\%$ across different models. We also validate our method on downstream tasks like natural language inference and cleaning corrupted corpora.This paper outlines a promising direction for non-similarity-based information retrieval which is currently underexplored.
Paperid:1699
Authors:Jiashun Liu · Johan Obando Ceron · Pablo Samuel Castro · Aaron Courville · Ling Pan
Abstract:
Abstract:Offpolicy deep reinforcement learning (RL) agents typically leverage replay buffers for reusing past experiences during learning. This can help sample efficiency when the collected data is informative and aligned with the learning objectives; when that is not the case, it has the effect of ``polluting'' the replay buffer with data that can exacerbate optimization challenges in addition to wasting environment interactions due to redundant sampling. We argue that sampling these uninformative and wasteful transitions can be avoided by addressing the **sunk cost fallacy** which, in the context of deep RL, is the tendency towards continuing an episode until termination. To address this, we propose the *learn to stop* (**LEAST**) mechanism which uses statistics based on $Q$-values and gradient to guide early episode termination which helps agents recognize when to terminate unproductive episodes early. We demonstrate that our method improves learning efficiency on a variety of RL algorithms, evaluated on both the MuJoCo and DeepMind Control Suite benchmarks.
Paperid:1700
Authors:Ilias Diakonikolas · Daniel Kane · Sushrut Karmalkar · Sihan Liu · Thanasis Pittas
Abstract:
Abstract:We study the task of listdecodable linear regression using batches, recently introduced by Das et al. 2023.. In this setting, we are given $m$ batches with each batch containing $n$ points in $\mathbb R^d$. A batch is called clean if the points it contains are i.i.d. samples from an unknown linear regression distribution. For a parameter $\alpha \in (0, 1/2)$, an unknown $\alpha$-fraction of the batches are clean and no assumptions are made on the remaining batches. The goal is to output a small list of vectors at least one of which is close to the true regressor vector in $\ell_2$-norm. Das et al. 2023 gave an efficient algorithm for this task, under natural distributional assumptions, with the following guarantee. Under the assumption that the batch size satisfies $n \geq \tilde{\Omega}(\alpha^{-1})$ and the total number of batches is $m = \text{poly}(d, n, 1/\alpha)$, their algorithm runs in polynomial time and outputs a list of $O(1/\alpha^2)$ vectors at least one of which is $\tilde{O}(\alpha^{-1/2}/\sqrt{n})$ close to the target regressor. Here we design a new polynomial-time algorithm for this task with significantly stronger guarantees under the assumption that the low-degree moments of the covariates distribution are Sum-of-Squares (SoS) certifiably bounded.Specifically, for any constant $\delta>0$, as long as the batch size is $n \geq \Omega_{\delta}(\alpha^{-\delta})$and the degree-$\Theta(1/\delta)$ moments of the covariates are SoS certifiably bounded, our algorithm uses $m = \text{poly}((dn)^{1/\delta}, 1/\alpha)$ batches,runs in polynomial-time, and outputs an $O(1/\alpha)$-sized list of vectors one of which is $O(\alpha^{-\delta/2}/\sqrt{n})$ close to the target. That is, our algorithm substantially improves both the minimum batch size and the final error guarantee, while achieving the optimal list size. Our approach leverages higher-order moment information by carefully combining the SoS paradigm interleaved with an iterative method and a novel list pruning procedure for this setting. In the process, we give an SoS proof of the Marcinkiewicz-Zygmund inequality that may be of broader applicability.
Paperid:1701
Authors:Nolan Koblischke · Hyunseok Jang · Kristen Menou · Mohamad Ali-Dib
Abstract:
Modern science emerged from reasoning over repeatedlyobserved planetary motions. We present Gravity-Bench-v1, an environment-based benchmark that challenges AI agents on tasks that parallel this historical development. Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations. Gravity-Bench includes out-of-distribution cases, i.e. with physics that deviates from the real world, to evaluate true scientific generalization capabilities. Agents must plan to collect data within an experimental budget and must perform a dynamic form of data analysis and reasoning to solve tasks efficiently. Our benchmark admits an open-ended space of solutions. PhD-level solutions for each task are provided, to calibrate AI performance against human expertise. Technically at an upper-undergraduate level, our benchmark proves challenging to baseline AI agents. Gravity-Bench-v1 and planned extensions should help map out AI progress towards scientific discovery capabilities.
Paperid:1702
Authors:Hao Fei · Yuan Zhou · Juncheng Li · Xiangtai Li · Qingshan Xu · Bobo Li · Shengqiong Wu · Yaoting Wang · Junbao Zhou · Jiahao Meng · Qingyu Shi · Zhiyuan Zhou · Liangtao Shi · Minghe Gao · Daoan Zhang · Zhiqi Ge · Siliang Tang · Kaihang Pan · Yaobo Ye · Haobo Yuan · Tao Zhang · Weiming Wu · Tianjie Ju · Zixiang Meng · Shilin Xu · Liyu Jia · Wentao Hu · Meng Luo · Jiebo Luo · Tat-Seng Chua · Hanwang Zhang · Shuicheng YAN
Abstract:
The current community evaluates large multimodal foundation models straightforwardly by following the belief that higher performance across tasks indicates a stronger capability. In this work, we argue that this evaluation approach might hinder the development of more general and powerful multimodal generalists in Multimodal Language Models (MLLMs). To address this, we propose a GeneralLevel framework, inspired by the five-level capability grading mechanisms in the autonomous driving industry, to assess the performance and generality of MLLMs across five levels. Central to this framework is the concept of Synergy, which categorizes capabilities based on whether MLLMs preserve synergy across comprehension, generation, and multimodal interactions. To evaluate the comprehensive abilities of various generalists, we present General-Bench, a massive multimodal benchmark encompassing a broader spectrum of skills, modalities, formats, and capabilities, with over 700 tasks and 325,800 instances. The evaluation results, involving over 100 state-of-the-art MLLMs, reveal the capability rankings of generalists, highlighting the challenges in achieving genuine AI.
Paperid:1703
Authors:Yuanhong Zhang · Muyao Yuan · Weizhan Zhang · Tieliang Gong · Wen Wen · Jiangyong Ying · Weijie Shi
Abstract:
The Segment Anything Model (SAM), a vision foundation model, exhibits impressive zeroshot capabilities in general tasks but struggles in specialized domains. Parameter-efficient fine-tuning (PEFT) is a promising approach to unleash the potential of SAM in novel scenarios. However, existing PEFT methods for SAM neglect the domain-invariant relations encoded in the pre-trained model. To bridge this gap, we propose InfoSAM, an information-theoretic approach that enhances SAM fine-tuning by distilling and preserving its pre-trained segmentation knowledge. Specifically, we formulate the knowledge transfer process as two novel mutual information-based objectives: (i) to compress the domain-invariant relation extracted from pre-trained SAM, excluding pseudo-invariant information as possible, and (ii) to maximize mutual information between the relational knowledge learned by the teacher (pre-trained SAM) and the student (fine-tuned model). The proposed InfoSAM establishes a robust distillation framework for PEFT of SAM. Extensive experiments across diverse benchmarks validate InfoSAM’s effectiveness in improving SAM family's performance on real-world tasks, demonstrating its adaptability and superiority in handling specialized scenarios.
Paperid:1704
Authors:Zhepei Wei · Wei-Lin Chen · Xinyu Zhu · Yu Meng
Abstract:
Abstract:Autoregressive decoding is inherently limited by its sequential token generation process, where each token must be generated before the next can be processed. This sequential dependency restricts the ability to fully leverage modern hardware's parallel processing capabilities. Existing methods like speculative decoding and layer skipping offer potential speedups but have notable drawbacks: speculative decoding relies on an auxiliary "drafter" model, which can be challenging to acquire and increases memory overhead, while layer skipping may introduce discrepancies in the generated outputs due to the missing keyvalue (KV) cache at skipped layers. In this work, we propose AdaDecode, a fast and accurate decoding method that accelerates autoregressive decoding without requiring auxiliary models or modifications to the original model parameters, while guaranteeing output consistency. AdaDecode leverages the insight that many tokens---particularly simple or highly-predictable ones---can accurately be generated using intermediate layer representations, as further layers often do not significantly alter predictions once the model reaches a certain confidence level. By adaptively generating tokens at intermediate layers when confidence is high, AdaDecode enables the next token’s computation to begin immediately. The remaining layer computations for early-predicted tokens are deferred and executed in parallel with subsequent tokens when needed, maximizing hardware utilization and reducing decoding latency. A final verification step ensures that early-predicted tokens match the results of standard autoregressive decoding, preserving output parity. Experiments across diverse generation tasks, including text summarization, code generation, and mathematical reasoning, demonstrate that AdaDecode consistently achieves superior decoding throughput compared to baselines with up to ${\bf 1.73 \times}$ speedup, while guaranteeing output parity with standard autoregressive decoding.
Paperid:1705
Authors:Jiaxuan Zhang · Emadeldeen Eldele · Fuyuan CAO · Yang Wang · Xiaoli Li · Jiye Liang
Abstract:
Estimating Individual Treatment Effects (ITE) from observational data is challenging due to covariate shift and counterfactual absence. While existing methods attempt to balance distributions globally, they often lack finegrained sample-level alignment, especially in scenarios with significant individual heterogeneity. To address these issues, we reconsider counterfactual as a proxy to emulate the balanced randomization. Furthermore, we derive a theoretical bound that links the expected ITE estimation error to both factual prediction errors and representation distances between factuals and counterfactuals. Building on this theoretical foundation, we propose FCCL, a novel method designed to effectively capture the nuances of potential outcomes under different treatments by (i) generating diffeomorphic counterfactuals that adhere to the data manifold while maintaining high semantic similarity to their factual counterparts, and (ii) mitigating distribution shift via sample-level alignment grounded in our derived generalization-error bound, which considers factual-counterfactual similarity and category consistency. Extensive evaluations on benchmark datasets demonstrate that FCCL outperforms 12 state-of-the-art methods, particularly in capturing individual-level heterogeneity and handling sparse boundary samples.
Paperid:1706
Authors:Xingjin Wang · Howe Tissue · Lu Wang · Linjing Li · Daniel Zeng
Abstract:
Continual PreTraining (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore thelearning dynamicsthroughout the CPT process for large language models (LLMs). We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate (LR) annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including the learning rate, the training steps, and the distribution distance between PT and CPT datasets.Moreover, our approach can be adapted to customize training hyper-parameters to different CPT goals such as balancing general and domain-specific performance.Extensive experiments demonstrate that our scaling law holds across various CPT datasets and training hyper-parameters.
Paperid:1707
Authors:Sunghwan Hong · Jaewoo Jung · Heeseong Shin · Jisang Han · Jiaolong Yang · Chong Luo · Seungryong Kim
Abstract:
We consider the problem of novel view synthesis from unposed images in a single feedforward. Our framework capitalizes on fast speed, scalability, and high-quality 3D reconstruction and view synthesis capabilities of 3DGS, where we further extend it to offer a practical solution that relaxes common assumptions such as dense image views, accurate camera poses, and substantial image overlaps. We achieve this through identifying and addressing unique challenges arising from the use of pixel-aligned 3DGS: misaligned 3D Gaussians across different views induce noisy or sparse gradients that destabilize training and hinder convergence, especially when above assumptions are not met. To mitigate this, we employ pre-trained monocular depth estimation and visual correspondence models to achieve coarse alignments of 3D Gaussians. We then introduce lightweight, learnable modules to refine depth and pose estimates from the coarse alignments, improving the quality of 3D reconstruction and novel view synthesis. Furthermore, the refined estimates are leveraged to estimate geometry confidence scores, which assess the reliability of 3D Gaussian centers and condition the prediction of Gaussian parameters accordingly. Extensive evaluations on large-scale real-world datasets demonstrate that PF3plat sets a new state-of-the-art across all benchmarks, supported by comprehensive ablation studies validating our design choices. We will make the code and weights publicly available.
Paperid:1708
Authors:Yuliang Liu · Junjie Lu · Chaofeng Qu · Zhaoling Chen · Zefan Cai · Jason Liu · Chonghan Liu · Yunhui Xia · Li Zhao · Jiang Bian · Chuheng Zhang · Wei Shen · Zhouhan Lin
Abstract:
Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rulebased techniques, such as using predefined placeholder tokens or setting the reasoning step's length into a fixed size.These approaches overlook the fact that true decision points are not typically marked by specific words in a text. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word. This division provides more decision-making insights at each step, enhancing downstream tasks, such as reward model processing. Moreover, our method does not require manual annotation. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks. Experimental results indicate that the AdaptiveStep division achieves state-of-the-art Best-of-N performance, surpassing greedy search strategies with token-level value-guided decoding, while also reducing construction costs by over 30\% compared to existing open-source PRMs. In addition, we provide a thorough analysis and case study on the PRM's performance, transferability, and generalization capabilities. We show our code atGitHub.
Paperid:1709
Authors:Zilin Kang · Chenyuan Hu · Yu Luo · Zhecheng Yuan · Ruijie Zheng · Huazhe Xu
Abstract:
Deep reinforcement learning for continuous control has recently achieved impressive progress. However, existing methods often suffer from primacy bias—a tendency to overfit early experiences stored in the replay buffer—which limits an RL agent’s sample efficiency and generalizability. A common existing approach to mitigate this issue is periodically resetting the agent during training. Yet, even after multiple resets, RL agents could still be impacted by early experiences. In contrast, humans are less susceptible to such bias, partly due toinfantile amnesia, where the formation of new neurons disrupts early memory traces, leading to the forgetting of initial experiences. Inspired by this dual processes of forgetting and growing in neuroscience, in this paper, we proposeForget and Grow(FoG), a new deep RL algorithm with two mechanisms introduced. First,Experience Replay Decay (ER Decay)—"forgetting early experience''—which balances memory by gradually reducing the influence of early experiences. Second,Network Expansion—"growing neural capacity''—which enhances agents' capability to exploit the patterns of existing data by dynamically adding new parameters during training. Empirical results on four major continuous control benchmarks with more than 40 tasks demonstrate the superior performance ofFoGagainst SoTA existing deep RL algorithms, including BRO, SimBa and TDMPC2.
Paperid:1710
Authors:Lihe Li · lei yuan · Pengsen Liu · Tao Jiang · Yang Yu
Abstract:
Training with diverse teammates is the key for learning generalizable agents. Typical approaches aim to generate diverse teammates by utilizing techniques like randomization, designing regularization terms, or reducing policy compatibility, etc. However, such teammates lack semantic information, resulting in inefficient teammate generation and poor adaptability of the agents. To tackle these challenges, we propose Semantically Diverse Teammates Generation (SemDiv), a novel framework leveraging the capabilities of large language models (LLMs) to discover and learn diverse coordination behaviors at the semantic level. In each iteration, SemDiv first generates a novel coordination behavior described in natural language, then translates it into a reward function to train a teammate policy. Once the policy is verified to be meaningful, novel, and aligned with the behavior, the agents train a policy for coordination. Through this iterative process, SemDiv efficiently generates a diverse set of semantically grounded teammates, enabling agents to develop specialized policies, and select the most suitable ones through languagebased reasoning to adapt to unseen teammates. Experiments across four MARL environments show that SemDiv generates teammates covering a wide range of coordination behaviors, including those unreachable by baseline methods. Evaluation with twenty unseen representative teammates demonstrates SemDiv's superior coordination and adaptability.
Paperid:1711
Authors:Marta Skreta · Tara Akhound-Sadegh · Viktor Ohanesian · Roberto Bondesan · Alan Aspuru-Guzik · Arnaud Doucet · Rob Brekelmans · Alexander Tong · Kirill Neklyudov
Abstract:
While scorebased generative models are the model of choice across diverse domains, there are limited tools available for controlling inference-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional `corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation.
Paperid:1712
Authors:Tam Le · Truyen Nguyen · Hideitsu Hino · Kenji Fukumizu
Abstract:
Abstract:We investigate the Sobolev IPM problem for probability measures supported on a graph metric space. Sobolev IPM is an important instance of integral probability metrics (IPM), and is obtained by constraining a critic function within a unit ball defined by the Sobolev norm. In particular, it has been used to compare probability measures and is crucial for several theoretical works in machine learning. However, to our knowledge, there are no efficient algorithmic approaches to compute Sobolev IPM effectively, which hinders its practical applications. In this work, we establish a relation between Sobolev norm and weighted $L^p$norm, and leverage it to propose a *novel regularization* for Sobolev IPM. By exploiting the graph structure, we demonstrate that the regularized Sobolev IPM provides a *closed-form* expression for fast computation. This advancement addresses long-standing computational challenges, and paves the way to apply Sobolev IPM for practical applications, even in large-scale settings. Additionally, the regularized Sobolev IPM is negative definite. Utilizing this property, we design positive-definite kernels upon the regularized Sobolev IPM, and provide preliminary evidences of their advantages on document classification and topological data analysis for measures on a graph.
Paperid:1713
Authors:Ziyu Gong · Jim Lim · David I. Inouye
Abstract:
Distribution matching (DM) is a versatile domaininvariant representation learning technique that has been applied to tasks such as fair classification, domain adaptation, and domain translation. Existing DM methods can struggle with scalability (non-parametric methods), instability and mode collapse (adversarial methods), and likelihood-based methods often impose unnecessary biases through fixed priors or require learning complex prior distributions.We address a critical limitation: absence of expressive yet learnable prior distributions that align with geometry-preserving regularization. Our key insight leverages the fact that gradient-based DM training only requires the prior's score function – not its density – enabling us to model the prior via denoising score matching. This approach eliminates biases from fixed priors (common in VAEs) and avoids the computational overhead of learning full prior densities (as in normalizing flows). Compared to other diffusion-based priors (e.g., LSGM), our method demonstrate better stability and computational efficiency. Furthermore, experiments demonstrate superior performance across benchmarks, establishing a new paradigm for efficient and flexible distribution matching.
Paperid:1714
Authors:Wenzhe Niu · Zongxia Xie · Yanru Sun · Wei He · Man Xu · Chao Hao
Abstract:
Recent research has shown an increasing interest in utilizing pretrained large language models (LLMs) for a variety of time series applications. However, there are three main challenges when using LLMs as foundational models for time series forecasting: (1) Cross-domain generalization. (2) Cross-modality alignment. (3) Error accumulation in autoregressive frameworks. To address these challenges, we proposedLangTime, alanguage-guided unified model fortimeseries forecasting that incorporates cross-domain pre-training with reinforcement learning-based fine-tuning. Specifically, LangTime constructs Temporal Comprehension Prompts (TCPs), which include dataset-wise and channel-wise instructions, to facilitate domain adaptation and condense time series into a single token, enabling LLMs to understand better and align temporal data. To improve autoregressive forecasting, we introduce TimePPO, a reinforcement learning-based fine-tuning algorithm. TimePPO mitigates error accumulation by leveraging a multidimensional rewards function tailored for time series and a repeat-based value estimation strategy. Extensive experiments demonstrate that LangTime achieves state-of-the-art cross-domain forecasting performance, while TimePPO fine-tuning effectively enhances the stability and accuracy of autoregressive forecasting.
Paperid:1715
Authors:Abhra Chaudhuri · Anjan Dutta · Tu Bui · Serban Georgescu
Abstract:
We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that crossmodal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims.
Paperid:1716
Authors:Junjie Wen · Yichen Zhu · Minjie Zhu · Zhibin Tang · Jinming Li · Zhongyi Zhou · Xiaoyu Liu · Chaomin Shen · Yaxin Peng · Feifei Feng
Abstract:
In this paper, we present DiffusionVLA, a novel framework that integrates autoregressive reasoning with diffusion policies to address the limitations of existing methods: while autoregressive VisionLanguage-Action (VLA) models lack precise and robust action generation, diffusion-based policies inherently lack reasoning capabilities. Central to our approach is autoregressive reasoning — a task decomposition and explanation process enabled by a pre-trained VLM — to guide diffusion-based action policies. To tightly couple reasoning with action generation, we introduce a reasoning injection module that directly embeds self-generated reasoning phrases into the policy learning process. The framework is simple, flexible, and efficient, enabling seamless deployment across diverse robotic platforms.We conduct extensive experiments using multiple real robots to validate the effectiveness of DiVLA. Our tests include a challenging factory sorting task, where DiVLA successfully categorizes objects, including those not seen during training. The reasoning injection module enhances interpretability, enabling explicit failure diagnosis by visualizing the model’s decision process. Additionally, we test DiVLA on a zero-shot bin-picking task, achieving \textbf{63.7\% accuracy on 102 previously unseen objects}. Our method demonstrates robustness to visual changes, such as distractors and new backgrounds, and easily adapts to new embodiments. Furthermore, DiVLA can follow novel instructions and retain conversational ability. Notably, DiVLA is data-efficient and fast at inference; our smallest DiVLA-2B runs 82Hz on a single A6000 GPU. Finally, we scale the model from 2B to 72B parameters, showcasing improved generalization capabilities with increased model size.
Paperid:1717
Authors:Yaming Guo · Siyang Guo · Hengshu Zhu · Ying Sun
Abstract:
Model editing plays a crucial role in the costeffective development of large language models, and the challenge of evolving knowledge facilitates its sequential extension, namely lifelong model editing. However, progress on standard and lifelong editing has historically followed separate tracks, overlooking the potential of generalizing standard methods to lifelong scenarios. By establishing this bridge, we can provide robust baselines in lifelong scenarios and ensure that lifelong editing benefits from the ongoing advancements in standard editing technologies. In response, this paper proposes a general framework,SimulatingIdealEditor(SimIE), which restores the strong performance of parameter-modifying methods from standard model editing in a lifelong context. SimIE formulates the ideal parameter shift as the minimum-norm solution to a linear system, constructed using the Moore-Penrose inverse, and subsequently enables recursive updates by truncating the limiting expression of the Moore-Penrose inverse under two mild assumptions. Theoretically, we demonstrate that if either assumption is not met, the solution provided by SimIE remains near-optimal in a statistical sense or stable against perturbations introduced by the sequential editing, but a trade-off between optimality and stability arises when both assumptions fail. Extensive experiments validate the effectiveness of SimIE, which allows standard algorithms to achieve performance comparable to specialized lifelong model editing methods.
Paperid:1718
Authors:Zeyu Yang · Wesley Armour
Abstract:
Deep learning models in computer vision have achieved significant success but pose increasing concerns about energy consumption and sustainability. Despite these concerns, there is a lack of comprehensive understanding of their energy efficiency during inference. In this study, we conduct a comprehensive analysis of the inference energy consumption of 1,200 ImageNet classification models—the largest evaluation of its kind to date. Our findings reveal a steep diminishing return in accuracy gains relative to the increase in energy usage, highlighting sustainability concerns in the pursuit of marginal improvements. We identify key factors contributing to energy consumption and demonstrate methods to improve energy efficiency. To promote more sustainable AI practices, we introduce an energy efficiency scoring system and develop an interactive web application that allows users to compare models based on accuracy and energy consumption. By providing extensive empirical data and practical tools, we aim to facilitate informed decisionmaking and encourage collaborative efforts in developing energy-efficient AI technologies.
Paperid:1719
Authors:Pierrick Leroy · Antonio Mastropietro · Marco Nurisso · Francesco Vaccarino
Abstract:
Face Recognition (FR) tasks have significantly progressed with the advent of Deep Neural Networks, particularly through marginbased triplet losses that embed facial images into high-dimensional feature spaces. During training, these contrastive losses focus exclusively on identity information as labels. However, we observe a multiscale geometric structure emerging in the embedding space, influenced by interpretable facial (e.g., hair color) and image attributes (e.g., contrast).We propose a geometric approach to describe the dependence or invariance of FR models to these attributes and introduce a physics-inspired alignment metric. We evaluate the proposed metric on controlled simplified models and widely used FR models fine-tuned with synthetic data for targeted attribute augmentation. Our findings reveal that the models exhibit varying degrees of invariance across different attributes, providing insight into their strengths and weaknesses and enabling deeper interpretability.
Paperid:1720
Authors:Yongxiang Tang · Yanhua Cheng · Xiaocheng Liu · chenchen Jiao · Yanxiang Zeng · Ning Luo · Pengjia Yuan · Xialong Liu · Peng Jiang
Abstract:
In many machine learning tasks, it is often necessary for the relationship between input and output variables to be monotonic, including both strictly monotonic and implicitly monotonic relationships. Traditional methods for maintaining monotonicity mainly rely on construction or regularization techniques. This paper shows that the issue of strict monotonic probability can be viewed as a partial order between an observable revenue variable and a latent cost variable. This perspective enables us to reformulate the monotonicity challenge into modeling the latent cost variable. To tackle this, we introduce a generative network for the latent cost variable, referred to as the Generative Cost Model (GCM), which naturally addresses the strict monotonic problem. Additionally, we propose the Implicit Generative Cost Model (IGCM) to address the implicit monotonic problem. We further validate our approach with a numerical simulation of quantile regression and conduct multiple experiments on public datasets, showing that our method significantly outperforms existing monotonic modeling techniques. The code for our experiments can be found athttps://github.com/icml259162/GCM.
Paperid:1721
Authors:Harsh Goel · Mohammad Omama · Behdad Chalaki · Vaishnav Tadiparthi · Ehsan Moradi Pari · Sandeep Chinchali
Abstract:
Multiagent reinforcement learning (MARL) has achieved significant progress in large-scale traffic control, autonomous vehicles, and robotics. Drawing inspiration from biological systems where roles naturally emerge to enable coordination, role-based MARL methods have been proposed to enhance cooperation learning for complex tasks. However, existing methods exclusively derive roles from an agent's past experience during training, neglecting their influence on its future trajectories. This paper introduces a key insight: an agent’s role should shape its future behavior to enable effective coordination. Hence, we propose Role Discovery and Diversity through Dynamics Models (R3DM), a novel role-based MARL framework that learns emergent roles by maximizing the mutual information between agents' roles, observed trajectories, and expected future behaviors. R3DM optimizes the proposed objective through contrastive learning on past trajectories to first derive intermediate roles that shape intrinsic rewards to promote diversity in future behaviors across different roles through a learned dynamics model. Benchmarking on SMAC and SMACv2 environments demonstrates that R3DM outperforms state-of-the-art MARL approaches, improving multi-agent coordination to increase win rates by up to 20%.
Paperid:1722
Authors:Geyu Liang · Senne Michielssen · Salar Fattahi
Abstract:
The tradeoff between accuracy and interpretability has long been a challenge in machine learning (ML). This tension is particularly significant for emerginginterpretable-by-designmethods, which aim to redesign ML algorithms for trustworthy interpretability but often sacrifice accuracy in the process. In this paper, we address this gap by investigating the impact of deviations in concept representations—an essential component of interpretable models—on prediction performance and propose a novel framework to mitigate these effects. The framework builds on the principle of optimizing concept embeddings under constraints that preserve interpretability. Using a generative model as a test-bed, we rigorously prove that our algorithm achieves zero loss while progressively enhancing the interpretability of the resulting model. Additionally, we evaluate the practical performance of our proposed framework in generating explainable predictions for image classification tasks across various benchmarks. Compared to existing explainable methods, our approach not only improves prediction accuracy while preserving model interpretability across various large-scale benchmarks but also achieves this with significantly lower computational cost.
Paperid:1723
Authors:Jeet Mohapatra · Nima Dehmamy · Csaba Both · Subhro Das · Tommi Jaakkola
Abstract:
We introduce a novel approach for discovering effective degrees of freedom (DOF) in molecular dynamics simulations by mapping the DOF to approximate symmetries of the energy landscape. Unlike most existing methods, we do not require data and rely on knowledge of the forcefield (energy function) and the initial state.We present a scalable symmetry loss function compatible with existing forcefield frameworks and a Hessian-based method efficient for smaller systems. Our approach enables systematic exploration of conformational space by connecting structural dynamics to energy landscape symmetries. We apply our method to two systems, Alanine dipeptide and Chignolin, recovering their known important conformations.Our approach can prove useful for efficient exploration in molecular simulations with potential applications in protein folding and drug discovery.
Paperid:1724
Authors:Anirudh Sundara Rajan · Yong Jae Lee
Abstract:
Detecting AIgenerated images is a challenging yet essential task. A primary difficulty arises from the detector’s tendency to rely on spurious patterns, such as compression artifacts, which can influence its decisions. These issues often stem from specific patterns that the detector associates with the real data distribution, making it difficult to isolate the actual generative traces. We argue that an image should be classified as fake if and only if it contains artifacts introduced by the generative model. Based on this premise, we propose Stay-Positive, an algorithm designed to constrain the detector’s focus to generative artifacts while disregarding those associated with real data. Experimental results demonstrate that detectors trained with Stay-Positive exhibit reduced susceptibility to spurious correlations, leading to improved generalization and robustness to post-processing. Additionally, unlike detectors that associate artifacts with real images, those that focus purely on fake artifacts are better at detecting inpainted real images.
Paperid:1725
Authors:Ziheng Qin · Hailun Xu · Wei Yew · Qi Jia · Yang Luo · Kanchan Sarkar · Danhui Guan · Kai Wang · Yang You
Abstract:
Machine learning relies heavily on data, yet the continuous growth of realworld data poses challenges for efficient dataset construction and training. Conventional approaches retain all available data, leading to non-optimal data and training efficiency. Active learning aims to reduce data redundancy by selecting a subset of samples to annotate, while it increases pipeline complexity and introduces bias.In this work, we propose Info-Coevolution, a new framework that efficiently enables models and data to coevolve through online selective annotation with no bias. Leveraging task-specific models (and open-source models), it selectively annotates and integrates online and web data to improve datasets efficiently. For real-world datasets like ImageNet-1K, Info-Coevolution reduces annotation and training costs by 32\% without performance. It can further reduce the annotation ratio to 50\% with semi-supervised learning. We also explore retrieval-based dataset enhancement using unlabeled open-source data. Code will be released.
Paperid:1726
Authors:Ye Zhang · Yu Zhou · Yifeng Wang · Jun Xiao · Ziyue Wang · Yongbing Zhang · Jianxu Chen
Abstract:
Cell instance segmentation is fundamental to analyzing biomedical images, yet accurately distinguishing tightly touching cells remains a persistent challenge. Existing instance segmentation frameworks, including detectionbased, contour-based, and distance mapping-based approaches, have made significant progress, but balancing model performance with computational efficiency remains an open problem. In this paper, we propose a novel cell instance segmentation method inspired by the four-color theorem. By conceptualizing cells as countries and tissues as oceans, we introduce a four-color encoding scheme that ensures adjacent instances receive distinct labels. This reformulation transforms instance segmentation into a constrained semantic segmentation problem with only four predicted classes, substantially simplifying the instance differentiation process. To solve the training instability caused by the non-uniqueness of four-color encoding, we design an asymptotic training strategy and encoding transformation method. Extensive experiments on different modes demonstrate that our approach achieves state-of-the-art performance. Our code is available at https://github.com/anonymity/FCIS.
Paperid:1727
Authors:Fang-Duo Tsai · Shih-Lun Wu · Weijaw Lee · Sheng-Ping Yang · Bo-Rui Chen · Hao-Chung Cheng · Yi-Hsuan Yang
Abstract:
We propose MuseControlLite, a lightweight mechanism with only 85M trainable parameters to fine-tune text-to-music generation models to gain pre-cise controllability over various time-varying con-ditions of musical attributes and from referenceaudio signals. The key finding is that positionalembeddings, which have been seldom used in theconditioner of text-to-music generation modelsfor text conditions, are critical when the conditionof interest is a function of time. Taking melodycontrol as an example, our experiment shows thatthe simple idea of adding rotary positional embed-dings to the so-called decoupled cross-attentionlayers greatly improves the control accuracy from56.6% to 70.9%, with 6.75 times fewer train-able parameters than state-of-the-art fine-tuningmechanisms, using the same pre-trained diffusionTransformer model of Stable Audio Open. Weprovide evaluation of various musical attributecontrols, audio inpainting, and audio outpainting,demonstrating improved controllability over Mu-sic ControlNet and Stable Audio Open Control-Net with lower cost of fine-tuning. We will opensource the code and model checkpoint, along withdemo examples which can be found at: https://MuseControlLite.github.io/web/.
Paperid:1728
Authors:Ilias Diakonikolas · Vasilis Kontonis · Christos Tzamos · Nikos Zarifis
Abstract:
Abstract:We study the task of online learning in the presence of Massart noise. Specifically, instead of assuming that the online adversary chooses an arbitrary sequence of labels, we assume that the context $\boldsymbol{x}$ is selected adversarially but the label $y$ presented to the learner disagrees with the groundtruth label of $\boldsymbol{x}$ with unknown probability {\em at most} $\eta$. We focus on the fundamental class of $\gamma$-margin linear classifiers and present the first computationally efficient algorithm that achieves mistake bound $\eta T + o(T)$. We point out that the mistake bound achieved by our algorithm is qualitatively tight for computationally efficient algorithms; this follows from the fact that, even in the offline setting, achieving 0-1 error better than $\eta$ requires super-polynomial time under standard complexity assumptions.We extend our online learning model to a $k$-arm contextual bandit setting where the rewards---instead of satisfying commonly used realizability assumptions---are consistent, in expectation, with some linear ranking function with weight vector $\boldsymbol{w}^\ast$. Given a list of contexts $\boldsymbol{x}_1,\ldots \boldsymbol{x}_k$, if $\boldsymbol{w}^*\cdot \boldsymbol{x}_i > \boldsymbol{w}^* \cdot \boldsymbol{x}_j$, the expected reward of action $i$must be larger than that of $j$ by at least $\Delta$. We use our Massart online learner to design an efficient bandit algorithm that obtains expected reward at least $(1-1/k)~ \Delta T - o(T)$ bigger than choosing a random action at every round.
Paperid:1729
Authors:Evi Micha · Vasilis Varsamis
Abstract:
Aggregating preferences under incomplete or constrained feedback is a fundamental problem in social choice and related domains. While prior work has established strong impossibility results for pairwise comparisons, this paper extends the inquiry to improvement feedback, where voters express incremental adjustments rather than complete preferences. We provide a complete characterization of the positional scoring rules that can be computed given improvement feedback. Interestingly, while plurality is learnable under improvement feedback—unlike with pairwise feedback—strong impossibility results persist for many other positional scoring rules. Furthermore, we show that improvement feedback, unlike pairwise feedback, does not suffice for the computation of any Condorcetconsistent rule. We complement our theoretical findings with experimental results, providing further insights into the practical implications of improvement feedback for preference aggregation.
Paperid:1730
Authors:Dohoon Lee · Jaehyun Park · Hyunwoo Kim · Kyogu Lee
Abstract:
Abstract:Flow and diffusionbased generative modeling demonstrates remarkable performance but lacks two critical properties of simulation-based methods: freedom of dimensionality and adaptability to different starting points of inference trajectories. While simulation-based methods possess these properties, they suffer from poor quality and inefficiency. To address this, we propose the Multidimensional Adaptive Coefficient (MAC), which extends unidimensional coefficients to multidimensional ones and dynamically adapts to starting points of trajectories via parameterization with $\phi$. Our method pre-trains flow and diffusion models $H_\theta$ with a simulation-free objective, followed by adversarial refinement for $\theta$ and $\phi$ under simulation-based objective. Experiments show that our method achieves performance enhancement across various flow and diffusion methodologies and datasets, while preserving training efficiency. Based on these results, we aim to diversify the attention on trajectory optimality and encourage further exploration in this field.
Paperid:1731
Authors:Amer Krivosija · Alexander Munteanu · André Nusser · Chris Schwiegelshohn
Abstract:
Abstract:This paper introduces $k$Dynamic Time Warping ($k$-DTW), a novel dissimilarity measure for polygonal curves. $k$-DTW has stronger metric properties than Dynamic Time Warping (DTW) and is more robust to outliers than the Fréchet distance, which are the two gold standards of dissimilarity measures for polygonal curves. We show interesting properties of $k$-DTW and give an exact algorithm as well as a $(1+\varepsilon)$-approximation algorithm for $k$-DTW by a parametric search for the $k$-th largest matched distance. We prove the first dimension-free learning bounds for curves and further learning theoretic results. $k$-DTW not only admits smaller sample size than DTW for the problem of learning the median of curves, where some factors depending on the curves' complexity $m$ are replaced by $k$, but we also show a surprising separation on the associated Rademacher and Gaussian complexities: $k$-DTW admits strictly smaller bounds than DTW, by a factor $\tilde\Omega(\sqrt{m})$ when $k\ll m$. We complement our theoretical findings with an experimental illustration of the benefits of using $k$-DTW for clustering and nearest neighbor classification.
Paperid:1732
Authors:Xinshuai Dong · Ignavier Ng · Boyang Sun · Haoyue Dai · Guangyuan Hao · Shunxing Fan · Peter Spirtes · Yumou Qiu · Kun Zhang
Abstract:
Recent advances have shown that statistical tests for the rank of crosscovariance matrices play an important role in causal discovery. These rank tests include partial correlation tests as special cases and provide further graphical information about latent variables. Existing rank tests typically assume that all the continuous variables can be perfectly measured, and yet, in practice many variables can only be measured after discretization. For example, in psychometric studies, the continuous level of certain personality dimensions of a person can only be measured after being discretized into order-preserving options such as disagree, neutral, and agree. Motivated by this, we propose Mixed data Permutation-based Rank Test (MPRT), which properly controls the statistical errors even when some or all variables are discretized. Theoretically, we establish the exchangeability and estimate the asymptotic null distribution by permutations; as a consequence, MPRT can effectively control the Type I error in the presence of discretization while previous methods cannot. Empirically, our method is validated by extensive experiments on synthetic data and real-world data to demonstrate its effectiveness as well as applicability in causal discovery.
Paperid:1733
Authors:Yuhao Liu · Yu Chen · Rui Hu · Longbo Huang
Abstract:
The stochastic interpolant framework offers a powerful approach for constructing generative models based on ordinary differential equations (ODEs) or stochastic differential equations (SDEs) to transform arbitrary data distributions. However, prior analyses of this framework have primarily focused on the continuoustime setting, assuming perfect solution of the underlying equations. In this work, we present the first discrete-time analysis of the stochastic interpolant framework, where we introduce a innovative discrete-time sampler and derive a finite-time upper bound on its distribution estimation error. Our result provides a novel quantification on how different factors, including the distance between source and target distributions and estimation accuracy, affect the convergence rate and also offers a new principled way to design efficient schedule for convergence acceleration. Finally, numerical experiments are conducted on the discrete-time sampler to corroborate our theoretical findings.
Paperid:1734
Authors:Chenhui Xu · Dancheng Liu · Yuting Hu · Jiajie Li · Ruiyang Qin · Qingxiao Zheng · Jinjun Xiong
Abstract:
PhysicsInformed Neural Networks (PINNs) are a kind of deep-learning-based numerical solvers for partial differential equations (PDEs). Existing PINNs often suffer from failure modes of being unable to propagate patterns of initial conditions. We discover that these failure modes are caused by the simplicity bias of neural networks and the mismatch between PDE's continuity and PINN's discrete sampling. We reveal that the State Space Model (SSM) can be a continuous-discrete articulation allowing initial condition propagation, and that simplicity bias can be eliminated by aligning a sequence of moderate granularity. Accordingly, we propose PINNMamba, a novel framework that introduces sub-sequence modeling with SSM. Experimental results show that PINNMamba can reduce errors by up to 86.3\% compared with state-of-the-art architecture. Our code is available at Supplementary Material.
Paperid:1735
Authors:Jianke Zhang · Yanjiang Guo · Yucheng Hu · Xiaoyu Chen · Xiang Zhu · Jianyu Chen
Abstract:
Recent advancements in VisionLanguage-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities.VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus onhigh-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics.These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms.In this paper, we investigate the training paradigm for VLAs, and introduce \textbf{UP-VLA}, a \textbf{U}nified VLA model training with both multi-modal \textbf{U}nderstanding and future \textbf{P}rediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33\% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.
Paperid:1736
Authors:Yingzhao Jian · Yue Zhang · Ying Wei · Hehe Fan · Yi Yang
Abstract:
Accurately modeling chemical reactions using Artificial Intelligence (AI) can accelerate discovery and development, especially in fields like drug design and material science. Although AI has made remarkable advancements in single molecule recognition, such as predicting molecular properties, the study of interactions between molecules, particularly chemical reactions, has been relatively overlooked. In this paper, we introduce Reaction Graph (RG), a unified graph representation that encapsulates the 3D molecular structures within chemical reactions. RG integrates the molecular graphs of reactants and products into a cohesive framework, effectively capturing the interatomic relationships pertinent to the reaction process. Additionally, it incorporates the 3D structure information of molecules in a simple yet effective manner. We conduct experiments on a range of tasks, including chemical reaction classification, condition prediction, and yield prediction. RG achieves the highest accuracy across six datasets, demonstrating its effectiveness. The code is in the supplementary materials and will be publicly available.
Paperid:1737
Authors:Jonas Zausinger · Lars Pennig · Anamarija Kozina · Sean Sdahl · Julian Sikora · Adrian Dendorfer · Timofey Kuznetsov · Mohamad Hagog · Nina Wiedemann · Kacper Chlodny · Vincent Limbach · Anna Ketteler · Thorben Prein · Vishwa Singh · Michael Danziger · Jannis Born
Abstract:
While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving quantitative reasoning, especially arithmetic. This is particularly relevant in scientific datasets where combinations of text and numbers are abundant. One fundamental limitation is the nature of the crossentropy (CE) loss, which assumes a nominal scale and thus cannot convey proximity between generated number tokens. In response, we here present a regression-like loss that operates purely on token level. Our proposed Number Token Loss (NTL) comes in two flavors and minimizes either the L p norm or the Wasserstein distance between the numerical values of the real and predicted number tokens. NTL can easily be added to any language model and extend the CE objective during training without runtime overhead. We evaluate the proposed scheme on various mathematical dataset and find that it consistently improves performance in math-related tasks. In a direct comparison on a regression task, we find that NTL can match the performance of a regression head, despite operating on token level. Finally, we scale NTL up to 3B parameters and observe improved performance, demonstrating its potential for seamless integration into LLMs.
Paperid:1738
Authors:Yinhong Liu · Zhijiang Guo · Tianya Liang · Ehsan Shareghi · Ivan Vulić · Nigel Collier
Abstract:
Large Language Models (LLMs) are expected to be predictable and trustworthy to support reliable decisionmaking systems. Yet current LLMs often show inconsistencies in their judgments. In this work, we examine \textit{logical preference consistency} as a foundational requirement for building more dependable LLM systems, ensuring stable and coherent decision-making while minimizing erratic or contradictory outputs.To quantify the logical preference consistency, we propose a universal evaluation framework based on three fundamental properties:transitivity,commutativityandnegation invariance.Through extensive experimentation across diverse LLMs, we demonstrate that these properties serve as strong indicators of judgment robustness.Furthermore, we introduce a data refinement and augmentation technique, REPAIR, that enhances logical consistency while maintaining alignment with human preferences. Finally, we show that improving consistency leads to better performance in LLM-driven logic-based algorithms, reinforcing stability and coherence in decision-making systems.
Paperid:1739
Authors:Hanyang Zhao · Haoxian Chen · Ji Zhang · David Yao · Wenpin Tang
Abstract:
Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area uses adiscretetimeformulation, which is prone to induced errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tuning diffusion models usingcontinuous-timeRL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby connecting to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models, Stable Diffusion v1.5.
Paperid:1740
Authors:Tyler Clark · Mark Towers · Christine Evers · Jonathon Hare
Abstract:
Rainbow Deep QNetwork (DQN) demonstrated combining multiple independent enhancements could significantly boost a reinforcement learning (RL) agent’s performance. In this paper, we present “Beyond The Rainbow” (BTR), a novel algorithm that integrates six improvements from across the RL literature to Rainbow DQN, establishing a new state-of-the-art for RL using a desktop PC, with a human-normalized interquartile mean (IQM) of 7.6 on Atari-60. Beyond Atari, we demonstrate BTR’s capability to handle complex 3D games, successfully training agents to play Super Mario Galaxy, Mario Kart, and Mortal Kombat with minimal algorithmic changes. Designing BTR with computational efficiency in mind, agents can be trained using a high-end desktop PC on 200 million Atari frames within 12 hours. Additionally, we conduct detailed ablation studies of each component, analyzing the performance and impact using numerous measures.
Paperid:1741
Authors:Yifu Yuan · Zhenrui Zheng · Zibin Dong · Jianye Hao
Abstract:
Multiobjective Reinforcement Learning (MORL) seeks to develop policies that simultaneously optimize multiple conflicting objectives, but it requires extensive online interactions. Offline MORL provides a promising solution by training on pre-collected datasets to generalize to any preference upon deployment. However, real-world offline datasets are often conservatively and narrowly distributed, failing to comprehensively cover preferences, leading to the emergence of out-of-distribution (OOD) preference areas. Existing offline MORL algorithms exhibit poor generalization to OOD preferences, resulting in policies that do not align with preferences. Leveraging the excellent expressive and generalization capabilities of diffusion models, we propose MODULI (Multi-objective Diffusion Planner with Sliding Guidance), which employs a preference-conditioned diffusion model as a planner to generate trajectories that align with various preferences and derive action for decision-making. To achieve accurate generation, MODULI introduces two return normalization methods under diverse preferences for refining guidance. To further enhance generalization to OOD preferences, MODULI proposes a novel sliding guidance mechanism, which involves training an additional slider adapter to capture the direction of preference changes. Incorporating the slider, it transitions from in-distribution (ID) preferences to generating OOD preferences, patching, and extending the incomplete Pareto front. Extensive experiments on the D4MORL benchmark demonstrate that our algorithm outperforms state-of-the-art Offline MORL baselines, exhibiting excellent generalization to OOD preferences.
Paperid:1742
Authors:Yu Liang · Yu Yang · Wenjie Wei · Ammar Belatreche · Shuai Wang · Malu Zhang · Yang Yang
Abstract:
Binary Spiking Neural Networks (BSNNs) offer promising efficiency advantages for resourceconstrained computing. However, their training algorithms often require substantial memory overhead due to latent weight storage and temporal processing requirements. To address this, we propose Binary Spiking Online Optimization (BSO), a novel online training algorithm that significantly reduces training memory while maintaining energy efficiency during the forward pass. BSO achieves this through two key innovations: memory requirements independent of time steps and elimination of latent weight storage. BSO directly updates binary weights through flip signals using the online training framework. These signals are triggered when gradient momentum exceeds a threshold, eliminating the need for latent weight during training. To leverage the inherent temporal dynamics of BSNNs, we further introduce T-BSO, a temporal-aware variant that captures gradient information across time steps for adaptive threshold adjustment. Theoretical analysis establishes convergence guarantees for both BSO and T-BSO, with formal regret bounds characterizing their convergence rates. Extensive experiments demonstrate that both BSO and T-BSO achieve superior optimization performance compared to existing training methods for binary spiking neural networks while significantly reducing memory overhead during training. Our work presents the first effective integration of online training with BSNNs, advancing their practical applicability in resource-constrained scenarios.
Paperid:1743
Authors:Vincent Roulet · Tianlin Liu · Nino Vieillard · Michael Sander · Mathieu Blondel
Abstract:
Abstract:The logistic loss (\aka crossentropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling. It is associated with the Kullback-Leibler (KL) divergence and the softargmax operator. In this work, we propose to construct new convex loss functions based on $f$-divergences. Our loss functions generalize the logistic loss in two directions: i) by replacing the KL divergence with $f$-divergences and ii) by allowing non-uniform reference measures. We instantiate our framework for numerous $f$-divergences, recovering existing losses and creating new ones.By analogy with the logistic loss, the loss function generated by an $f$-divergence is associated with an operator, that we dub $f$-softargmax. We derive a novel parallelizable bisection algorithm for computing the $f$-softargmax associated with any $f$-divergence.On the empirical side, one of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting, including on pre-training, post-training (SFT) and distillation. We show that the loss function generated by the $\alpha$-divergence (which is equivalent to Tsallis $\alpha$-negentropy in the case of unit reference measures) with $\alpha=1.5$ performs well across several tasks.
Paperid:1744
Authors:Xinyi Lu · Hao Zhang · Chenglin Li · Weijia Lu · ZHIFEI YANG · Wenrui Dai · xiaodong Zhang · Xiaofeng Ma · Can Zhang · Junni Zou · Hongkai Xiong
Abstract:
The significant communication overhead and client data heterogeneity have posed an important challenge to current federated learning (FL) paradigm. Existing compressionbased and optimization-based FL algorithms typically focus on addressing either the model compression challenge or the data heterogeneity issue individually, rather than tackling both of them. In this paper, we observe that by symbolizing the client model updates to be uploaded (i.e., normalizing the magnitude for each model parameter at local clients), the model heterogeneity can be mitigated that is essentially stemmed from data heterogeneity, thereby helping improve the overall generalization performance of the globally aggregated model at the server. Inspired with this observation, and further motivated by the success of Lion optimizer in achieving the optimal performance on most tasks in the centralized learning, we propose a new FL algorithm, called FedSMU, which simultaneously reduces the communication overhead and alleviates the data heterogeneity issue. Specifically, FedSMU splits the standard Lion optimizer into the local updates and global execution, where only the symbol of client model updates commutes between the client and server. We theoretically prove the convergence of FedSMU for the general non-convex settings. Through extensive experimental evaluations on several benchmark datasets, we demonstrate that our FedSMU algorithm not only reduces the communication overhead, but also achieves a better generalization performance than the other compression-based and optimization-based baselines.
Paperid:1745
Authors:Angéline Pouget · Mohammad Yaghini · Stephan Rabanser · Nicolas Papernot
Abstract:
Deploying machine learning models in safetycritical domains poses a key challenge: ensuring reliable model performance on downstream user data without access to ground truth labels for direct validation. We propose the suitability filter, a novel framework designed to detect performance deterioration by utilizing suitability signals - model output features that are sensitive to covariate shifts and indicative of potential prediction errors. The suitability filter evaluates whether classifier accuracy on unlabeled user data shows significant degradation compared to the accuracy measured on the labeled test dataset. Specifically, it ensures that this degredation does not exceed a pre-specified margin, which represents the maximum acceptable drop in accuracy. To achieve reliable performance evaluation, we aggregate suitability signals for both test and user data and compare these empirical distributions using statistical hypothesis testing, thus providing insights into decision uncertainty. Our modular method adapts to various models and domains. Empirical evaluations across different classification tasks demonstrate that the suitability filter reliably detects performance deviations due to covariate shift. This enables proactive mitigation of potential failures in high-stakes applications.
Paperid:1746
Authors:Deyuan Liu · Zecheng Wang · Bingning Wang · Weipeng Chen · Chunshan Li · Zhiying Tu · Dianhui Chu · Dianbo Sui
Abstract:
The rapid proliferation of large language models (LLMs), such as GPT4 and Gemini, underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. In this paper, we introduce a novel checkpoint merging strategy aimed at making efficient use of intermediate checkpoints during LLM pretraining. This method utilizes intermediate checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.
Paperid:1747
Authors:Junhan Kim · Ho-young Kim · Eulrang Cho · Chungman Lee · Joonyoung Kim · Yongkweon Jeon
Abstract:
Quantization is a promising solution for deploying large language models (LLMs) on resourceconstrained devices. Early methods developed for smaller networks like ResNet rely on gradient-based optimization, which becomes impractical for hyper-scale LLMs with billions of parameters.While recently proposed backpropagation-free or transformation-based methods alleviate this issue, their performance remains limited by either a lack of inter-layer dependency consideration or the use of naive nearest-rounding-based integer weight assignment to save the heavy computational cost of weight optimization.We thus introduce a novel backpropagation-free PTQ algorithm that optimizes integer weights by considering inter-layer dependencies. The key innovation is the development of attention-aware Hessian matrices that capture inter-layer interactions within the attention module. Extensive experiments demonstrate that our approach not only outperforms existing weight quantization methods but also shows good synergy with conventional methods to suppress activation outliers, leading to state-of-the-art weight-activation quantization performance.
Paperid:1748
Authors:Fei Zhang · Pei Zhang · Baosong Yang · Fei Huang · Ya Zhang · Yanfeng Wang
Abstract:
This paper presents the first study on adapting the visual incontext learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model's in-context understanding for improved reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model's in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks.
Paperid:1749
Authors:Elias Chaibub Neto
Abstract:
The development of deep generative models for tabular data is currently a very active research area in machine learning. These models, however, tend to be computationally heavy and require careful tuning of multiple model parameters. In this paper, we propose TabSDS a lightweight, non-parametric, and model free alternative to tabular deep generative models which leverages rank and data shuffling transformations for generating synthetic data which closely approximates the joint probability distribution of the real data. We evaluate TabSDS against multiple baselines implemented in the Synthcity Python library across several datasets. TabSDS showed very competitive performance against all baselines (including TabDDPM - the current state-of-the-art in tabular data generation). Importantly, the execution time of TabSDS is orders of magnitude faster than the deep generative baselines, and also considerably faster than other computationally efficient baselines such as adversarial random forests.
Paperid:1750
Authors:Deyu Bo · Xinchao Wang
Abstract:
This study introduces dataset distillation (DD) tailored for 3D data, particularly point clouds. DD aims to substitute largescale real datasets with a small set of synthetic samples while preserving model performance. Existing methods mainly focus on structured data such as images. However, adapting DD for unstructured point clouds poses challenges due to their diverse orientations and resolutions in 3D space. To address these challenges, we theoretically demonstrate the importance of matching rotation-invariant features between real and synthetic data for 3D distillation. We further propose a plug-and-play point cloud rotator to align the point cloud to a canonical orientation, facilitating the learning of rotation-invariant features by all point cloud models. Furthermore, instead of optimizing fixed-size synthetic data directly, we devise a point-wise generator to produce point clouds at various resolutions based on the sampled noise amount. Compared to conventional DD methods, the proposed approach, termed DD3D, enables efficient training on low-resolution point clouds while generating high-resolution data for evaluation, thereby significantly reducing memory requirements and enhancing model scalability. Extensive experiments validate the effectiveness of DD3D in shape classification and part segmentation tasks across diverse scenarios, such as cross-architecture and cross-resolution settings.
Paperid:1751
Authors:Ashkan Shahbazi · Elaheh Akbari · Darian Salehi · XINRAN LIU · Navid NaderiAlizadeh · Soheil Kolouri
Abstract:
While selfattention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces doubly stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments on multiple benchmarks spanning image classification, point cloud classification, sentiment analysis, and neural machine translation demonstrate enhanced attention regularization, resulting in superior performance across diverse applications.
Paperid:1752
Authors:Harrish Thasarathan · Julian Forsyth · Thomas Fel · Matthew Kowal · Konstantinos Derpanis
Abstract:
We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing conceptbased interpretability methods, which focus on a single model, USAEs jointly learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once. Our core insight is to train a single, overcomplete sparse autoencoder (SAE) that ingests activations from any model and decodes them to approximate the activations of any other model under consideration. By optimizing a shared objective, the learned dictionary captures common factors of variation—concepts—across different tasks, architectures, and datasets. We show that USAEs discover semantically coherent and important universal concepts across vision models; ranging from low-level features (e.g., colors and textures) to higher-level structures (e.g., parts and objects). Overall, USAEs provide a powerful new method for interpretable cross-model analysis and offers novel applications—such as coordinated activation maximization—that open avenues for deeper insights in multi-model AI systems.
Paperid:1753
Authors:Xichen Ye · Yifan Wu · Weizhong Zhang · Cheng Jin · Yifan Chen
Abstract:
The Influence Function (IF) is a widely used technique for assessing the impact of individual training samples on model predictions.However, existing IF methods often fail to provide reliable influence estimates in deep neural networks, particularly when applied to noisy training data.This issue does not stem from inaccuracies in parameter change estimation, which has been the primary focus of prior research, but rather from deficiencies in loss change estimation, specifically due to the sharpness of validation risk.In this work, we establish a theoretical connection between influence estimation error, validation set risk, and its sharpness, underscoring the importance of flat validation minima for accurate influence estimation.Furthermore, we introduce a novel estimation form of Influence Function specifically designed for flat validation minima.Experimental results across various tasks validate the superiority of our approach.The model implementation and the code are provided in the supplementary materials and scheduled to be opensourced upon publication.
Paperid:1754
Authors:Tianyu Chen · Haoyi Zhou · Ying Li · Hao Wang · Chonghan Gao · Rongye Shi · Shanghang Zhang · Jianxin Li
Abstract:
Foundation models have revolutionized language modeling, while whether this success is replicated in scientific computing remains unexplored. We present OmniArch, the first prototype aiming at solving multiscale and multi-physics scientific computing problems with physical alignment. We addressed all three challenges with one unified architecture. Its pre-training stage contains a Fourier Encoder-decoder fading out the disharmony across separated dimensions and a Transformer backbone integrating quantities through temporal dynamics, and the novel PDE-Aligner performs physics-informed fine-tuning under flexible conditions. As far as we know, we first conduct 1D-2D-3D united pre-training on the PDEBench, and it sets not only new performance benchmarks for 1D, 2D, and 3D PDEs but also demonstrates exceptional adaptability to new physics via in-context and zero-shot learning approaches, which supports realistic engineering applications and foresight physics discovery.
Paperid:1755
Authors:Jingwei Li · Jing Dong · Tianxing He · Jingzhao Zhang
Abstract:
Given the rising popularity of AIgenerated art and the associated copyright concerns, identifying whether an artwork was used to train a diffusion model is an important research topic.The work approaches this problem from the membership inference attack (MIA) perspective. We first identify the limitation of applying existing MIA methods for proprietary diffusion models: the required access of internal U-nets.To address the above problem, we introduce a novel membership inference attack method that uses only the image-to-image variation API and operates without access to the model's internal U-net. Our method is based on the intuition that the model can more easily obtain an unbiased noise prediction estimate for images from the training set. By applying the API multiple times to the target image, averaging the outputs, and comparing the result to the original image, our approach can classify whether a sample was part of the training set. We validate our method using DDIM and Stable Diffusion setups and further extend both our approach and existing algorithms to the Diffusion Transformer architecture. Our experimental results consistently outperform previous methods.
Paperid:1756
Authors:Shayan Kiyani · George Pappas · Aaron Roth · Hamed Hassani
Abstract:
A fundamental question in datadriven decision making is how to quantify the uncertainty of predictions to inform risk-sensitive downstream actions, as often required in domains such as medicine. We develop a decision-theoretic foundation linking prediction sets to risk-averse decision-making, addressing three questions: (1) What is the correct notion of uncertainty quantification for risk-averse decision makers? We prove that prediction sets are optimal for decision makers who wish to optimize their value at risk. (2) What is the optimal policy that a risk averse decision maker should use to map prediction sets to actions? We show that a simple max-min decision policy is optimal for risk-averse decision makers. Finally, (3) How can we derive prediction sets that are optimal for such decision makers? We provide an exact characterization in the population regime and a distribution free finite-sample construction. These insights leads toRisk-Averse Calibration (RAC), a principled algorithm that is bothpractical—exploiting black-box predictions to enhance downstream utility—andsafe—adhering to user-defined risk thresholds. We experimentally demonstrate RAC's advantages in medical diagnosis and recommendation systems, showing that it substantially improves the trade-off between safety and utility, delivering higher utility than existing methods while avoiding critical errors.
Paperid:1757
Authors:Kaiyu Guo · Zijian Wang · Tan Pan · Brian Lovell · Mahsa Baktashmotlagh
Abstract:
Outof-Distribution (OOD) detection is essential for the trustworthiness of AI systems. Methods using prior information (i.e., subspace-based methods) have shown effective performance by extracting information geometry to detect OOD data with a more appropriate distance metric. However, these methods fail to address the geometry distorted by ill-distributed samples, due to the limitation of statically extracting information geometry from the training distribution. In this paper, we argue that the influence of ill-distributed samples can be corrected by dynamically adjusting the prior geometry in response to new data. Based on this insight, we propose a novel approach that dynamically updates the prior covariance matrix using real-time input features, refining its information. Specifically, we reduce the covariance along the direction of real-time input features and constrain adjustments to the residual space, thus preserving essential data characteristics and avoiding effects on unintended directions in the principal space. We evaluate our method on two pre-trained models for the CIFAR dataset and five pre-trained models for ImageNet-1k, including the self-supervised DINO model. Extensive experiments demonstrate that our approach significantly enhances OOD detection across various models. Our paper's code will be made available.
Paperid:1758
Authors:Yifan Zhang · Ge Zhang · Yue Wu · Kangping Xu · Quanquan Gu
Abstract:
Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the BradleyTerry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. In this paper, we introduce preference embedding, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback (RLHF). Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0, following the language model post-training with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://anonymous.4open.science/r/gpm-anonymous-1627.
Paperid:1759
Authors:peijie liu · Fengli Xu · Yong Li
Abstract:
Chainof-Thought (CoT) reasoning has been shown to enhance the performance of large language models (LLMs) on complex tasks. However, effectiveness of CoT is inconsistent across all fields, its performance varies both across tasks and within the same task when evaluated on different models. The mechanism behind the effectiveness of CoT remains a long-standing research question. In this work, We initially observe that the monotonic nature of the token probability distribution might be linked to CoT gain. Leveraging this insight, we propose two metrics based on the token probability distribution to assess CoT effectiveness across different benchmarks. By combining instance-level indicators with logistic regression models, we design the Dynamic CoT to dynamically select between CoT and direct answers. Furthermore, leveraging an ensemble approach, we extend Dynamic CoT to closed-source models. The results show that our proposed indicators for assessing CoT effectiveness achieve an accuracy of 87.5%, and Dynamic CoT reduces token consumption by more than 35% while maintaining high accuracy. Overall, we understand CoT reasoning from a more novel perspective and provide a framework for its more efficient deployment, with the potential to enhance LLM performance across a range of tasks.
Paperid:1760
Authors:Jianting Chen
Abstract:
Coreset selection (CS) for deep learning has become crucial for enhancing training efficiency and understanding datasets by identifying the most informative subsets. However, most existing methods rely on heuristics or complex optimization, struggling to balance efficiency and theoretical guarantees. To address this, we propose a novel CS objective that adaptively balances losses between core-set and non-core-set samples by minimizing the sum of squared losses across all samples. Building on this objective, we introduce theMaximum Reduction as Maximum Contribution criterion (MRMC), which identifies samples with the maximal reduction in loss as those making the maximal contribution to overall convergence. Additionally, a balance constraint is incorporated to ensure an even distribution of contributions from the core-set. Experimental results demonstrate that MRMC improves training efficiency significantly while preserving model performance with minimal cost.
Paperid:1761
Authors:Fahim Tajwar · Yiding Jiang · Abitha Thankaraj · Sumaita Rahman · Zico Kolter · Jeff Schneider · Ruslan Salakhutdinov
Abstract:
Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we presentPAPRIKA, a finetuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, PAPRIKA teaches models to explore and adapt their behavior based on the environment feedback in context without gradient updates. Experimental results show that models fine-tuned with PAPRIKA can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. We also introduce a curriculum learning algorithm that improves PAPRIKA's sample efficiency. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.
Paperid:1762
Authors:Lucas Fabian Naumann · Jannik Irmai · Bjoern Andres
Abstract:
Abstract:The Quantum Alternating Operator Ansatz (QAOA) is a hybrid quantumclassical variational algorithm for approximately solving combinatorial optimization problems on Noisy Intermediate-Scale Quantum devices. Although it has been successfully applied to a variety of problems, there is only limited work on correlation clustering due to the difficulty of modelling the problem constraints with the ansatz. Motivated by this, we present a generalization of QAOA that is more suitable for this problem. In particular, we modify QAOA in two ways: Firstly, we use nucleus sampling for the computation of the cost function. Secondly, we split the problem into sub-problems, solving each individually with QAOA. We show theoretically that this generalization that we call Sub-Problem Quantum Alternating Operator Ansatz (SQAOA) can obtain optimal solutions for correlation clustering with certainty when $p \rightarrow \infty$. Further, we show experimentally that SQAOA achieves better approximation ratios than QAOA for correlation clustering, while at the same time using only one qubit per node of the respective problem instance and reducing the runtime (of simulations).
Paperid:1763
Authors:Yilong Song · Peijin Li · Bin Gao · Kun Yuan
Abstract:
Optimization problems on the Stiefel manifold, ranging from principal component analysis to enhancing neural network robustness, are ubiquitous in machine learning. The Landing algorithm avoids computationally expensive retraction operations, making it highly competitive for largescale problems. This paper extends this method to distributed settings, introducingEF-Landing, the first retraction-free and communication-efficient algorithm for distributed stochastic optimization on the Stiefel manifold. By incorporating communication compression and error feedback, EF-Landing ensures convergence and constraint feasibility while significantly reducing communication overhead. We provide sharp convergence guarantees, demonstrating that EF-Landing achieves the same asymptotic linear speedup convergence rate as existing methods without communication compression. Furthermore, our analysis is highly versatile, applying to both deterministic and stochastic settings and encompassing algorithms based on gradient descent or momentum gradient descent. We also generalize EF-Landing to operate on block-wise Stiefel manifolds, enabling greater flexibility for structured constraints. Extensive numerical experiments validate our theoretical results.
Paperid:1764
Authors:Ruichen Xu · Kexin Chen
Abstract:
Recent theoretical studies (Kou et al., 2023; Cao et al., 2022) have revealed a sharp phase transition from benign to harmful overfitting when the noiseto-feature ratio exceeds a threshold—a situation common in long-tailed data distributions where atypical data is prevalent. However, harmful overfitting rarely happens in overparameterized neural networks. Further experimental results suggested that memorization is necessary for achieving near-optimal generalization error in long-tailed data distributions (Feldman & Zhang, 2020). We argue that this discrepancy between theoretical predictions and empirical observations arises because previous feature-noise data models overlook the heterogeneous nature of noise across different data classes. In this paper, we refine the feature-noise data model by incorporating class-dependent heterogeneous noise and re-examine the overfitting phenomenon in neural networks. Through a comprehensive analysis of the training dynamics, we establish test loss bounds for the refined model. Our findings reveal that neural networks can leverage "data noise", previously deemed harmful, to learn implicit features that improve the classification accuracy for long-tailed data. Experimental validation on both synthetic and real-world datasets supports our theoretical results.
Paperid:1765
Authors:Xiaorui Su · Shvat Messica · Yepeng Huang · Ruth Johnson · Lukas Fesser · Shanghua Gao · Faryad Sahneh · Marinka Zitnik
Abstract:
Foundation models trained on patient electronic health records (EHRs) require tokenizing medical data into sequences of discrete vocabulary items. Existing tokenizers treat medical codes from EHRs as isolated textual tokens. However, each medical code is defined by its textual description, its position in ontological hierarchies, and its relationships to other codes, such as disease cooccurrences and drug-treatment associations. Medical vocabularies contain more than 600,000 codes with critical information for clinical reasoning. We introduce MedTok, a multimodal medical code tokenizer that uses the text descriptions and relational context of codes. MedTok processes text using a language model encoder and encodes the relational structure with a graph encoder. It then quantizes both modalities into a unified token space, preserving modality-specific and cross-modality information. We integrate MedTok into five EHR models and evaluate it on operational and clinical tasks across in-patient and out-patient datasets, including outcome prediction, diagnosis classification, drug recommendation, and risk stratification. Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate using MedTok tokenizer with medical QA systems. Our results demonstrate the potential of MedTok as a unified tokenizer for medical codes, improving tokenization for medical foundation models.
Paperid:1766
Authors:Xinyu Peng · Ziyang Zheng · Yaoming Wang · Han Li · Nuowen Kan · Wenrui Dai · Chenglin Li · Junni Zou · Hongkai Xiong
Abstract:
We propose Noise Conditional Variational Score Distillation (NCVSD), a novel method for distilling a pretrained diffusion model into generative denoiser. We achieve this by revealing that the unconditional score function implicitly characterizes the score function of denoising posterior distributions. By integrating this insight into the Variational Score Distillation (VSD) framework, we enable scalable learning of generative denoisers capable of approximating samples from the denoising posterior distribution across a wide range of noise levels. The proposed generative denoisers exhibit desirable properties that allow fast generation while preserve the benefit of iterative refinement: (1) fast onestep generation through sampling from pure Gaussian noise at high noise levels; (2) improved sample quality by scaling the test-time compute with multi-step sampling; and (3) zero-shot probabilistic inference for flexible and controllable sampling. We evaluate NCVSD through extensive experiments, including class-conditional image generation and inverse problem solving. By scaling the test-time compute, our method outperforms teacher diffusion models and is on par with consistency models of larger sizes. Additionally, with significantly fewer NFEs than diffusion-based methods, we achieve record-breaking LPIPS on inverse problems.
Paperid:1767
Authors:Theodoros Kouzelis · Ioannis Kakogeorgiou · Spyros Gidaris · Nikos Komodakis
Abstract:
Latent generative models have emerged as a leading approach for highquality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a ×7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models.
Paperid:1768
Authors:Xiukun Wei · Xueru Zhang
Abstract:
Recent advances in generative models have made it increasingly difficult to distinguish real data from modelgenerated synthetic data. Using synthetic data for successive training of future model generations creates “self-consuming loops,” which may lead to model collapse or training instability. Furthermore, synthetic data is often subject to human feedback and curated by users based on their preferences. Ferbach et al. (2024) recently showed that when data is curated according to user preferences, the self-consuming retraining loop drives the model to converge toward a distribution that optimizes those preferences. However, in practice, data curation is often noisy or adversarially manipulated. For example, competing platforms may recruit malicious users to adversarially curate data and disrupt rival models. In this paper, we study how generative models evolve under self-consuming retraining loops with noisy and adversarially curated data. We theoretically analyze the impact of such noisy data curation on generative models and identify conditions for the robustness and stability of the retraining process. Building on this analysis, we design attack algorithms for competitive adversarial scenarios, where a platform with a limited budget employs malicious users to misalign a rival’s model from actual user preferences. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed algorithms.
Paperid:1769
Authors:Hang Gao · Huang Wenxuan · Fengge Wu · Zhao Junsuo · Changwen Zheng · Huaping Liu
Abstract:
The use of large language models (LLMs) as feature enhancers to optimize node representations, which are then used as inputs for graph neural networks (GNNs), has shown significant potential in graph representation learning. However, the fundamental properties of this approach remain underexplored. To address this issue, we propose conducting a more indepth analysis of this issue based on the interchange intervention method. First, we construct a synthetic graph dataset with controllable causal relationships, enabling precise manipulation of semantic relationships and causal modeling to provide data for analysis. Using this dataset, we conduct interchange interventions to examine the deeper properties of LLM enhancers and GNNs, uncovering their underlying logic and internal mechanisms. Building on the analytical results, we design a plug-and-play optimization module to improve the information transfer between LLM enhancers and GNNs. Experiments across multiple datasets and models validate the proposed module.
Paperid:1770
Authors:Zhihui Xie · Jie chen · Liyu Chen · Weichao Mao · Jingjing Xu · Lingpeng Kong
Abstract:
Abstract:Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide *accurate judgments* and *actionable suggestions*. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models.Furthermore, we show that these critic models act as accurate generative reward models and enable testtime scaling through iterative critique-revision, achieving up to 106.1\% relative improvements across challenging code generation benchmarks.
Paperid:1771
Authors:Nian Liu · Xiaoxin He · Thomas Laurent · Francesco Di Giovanni · Michael Bronstein · Xavier Bresson
Abstract:
Spectral graph convolution, an important tool of data filtering on graphs, relies on two essential decisions: selecting spectral bases for signal transformation and parameterizing the kernel for frequency analysis. While recent techniques mainly focus on standard Fourier transform and vectorvalued spectral functions, they fall short in flexibility to model signal distributions over large spatial ranges, and capacity of spectral function. In this paper, we present a novel wavelet-based graph convolution network, namely WaveGC, which integrates multi-resolution spectral bases and a matrix-valued filter kernel. Theoretically, we establish that WaveGC can effectively capture and decouple short-range and long-range information, providing superior filtering flexibility, surpassing existing graph wavelet neural networks. To instantiate WaveGC, we introduce a novel technique for learning general graph wavelets by separately combining odd and even terms of Chebyshev polynomials. This approach strictly satisfies wavelet admissibility criteria. Our numerical experiments showcase the consistent improvements in both short-range and long-range tasks. This underscores the effectiveness of the proposed model in handling different scenarios. Our code is available at https://anonymous.4open.science/r/WaveGC.
Paperid:1772
Authors:Yucen Wang · Rui Yu · Shenghua Wan · Le Gan · De-Chuan Zhan
Abstract:
Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we proposeFOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable openended decision-making in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal.FOUNDERdemonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated.
Paperid:1773
Authors:Benjamin Fuhrer · Chen Tessler · Gal Dalal
Abstract:
We present Gradient Boosting Reinforcement Learning (GBRL), a framework that adapts the strengths of Gradient Boosting Trees (GBT) to reinforcement learning (RL) tasks. While neural networks (NNs) have become the de facto choice for RL, they face significant challenges with structured and categorical features and tend to generalize poorly to outof-distribution samples. These are challenges for which GBTs have traditionally excelled in supervised learning. However, GBT's application in RL has been limited. The design of traditional GBT libraries is optimized for static datasets with fixed labels, making them incompatible with RL's dynamic nature, where both state distributions and reward signals evolve during training. GBRL overcomes this limitation by continuously interleaving tree construction with environment interaction. Through extensive experiments, we demonstrate that GBRL outperforms NNs in domains with structured observations and categorical features, while maintaining competitive performance on standard continuous control benchmarks. Like its supervised learning counterpart, GBRL demonstrates superior robustness to out-of-distribution samples and better handles irregular state-action relationships.
Paperid:1774
Authors:Armando Bellante · Martin Plávala · Alessandro Luongo
Abstract:
This paper proposes a family of permutationinvariant graph embeddings, generalizing the Skew Spectrum of graphs introduced by Kondor & Borgwardt (2008). Grounded in group theory and harmonic analysis, our method introduces a new class of graph invariants that are isomorphism-invariant and capable of embedding richer graph structures - including attributed graphs, multilayer graphs, and hypergraphs - which the Skew Spectrum could not handle. Our generalization further defines a family of functions that enables a trade-off between computational complexity and expressivity. By applying generalization-preserving heuristics to this family, we improve the Skew Spectrum's expressivity at the same computational cost. We formally prove the invariance of our generalization, demonstrate its improved expressiveness through experiments, and discuss its efficient computation in the heuristic setting.
Paperid:1775
Authors:XUCHEN · Yuxuan Yue · Dawei Yang · Zukang Xu · Xing Hu · JiangyongYu · Zhixuan Chen · Sifan Zhou · Zhihang Yuan
Abstract:
Abstract:RWKV is a modern RNN architecture with comparable performance to Transformer, but still faces challenges when deployed to resourceconstrained devices. Post Training Quantization (PTQ), which is a an essential technique to reduce model size and inference latency, has been widely used in Transformer models.However, it suffers significant degradation of performance when applied to RWKV.This paper investigates and identifies two key constraints inherent in the properties of RWKV: (1) Non-linear operators hinder the parameter-fusion of both smooth- and rotation-based quantization, introducing extra computation overhead. (2) The larger amount of uniformly distributed weights poses challenges for cluster-based quantization, leading to reduced accuracy.To this end, we propose RWKVQuant, a PTQ framework tailored for RWKV models, consisting of two novel techniques: (1) a coarse-to-fine proxy capable of adaptively selecting different quantization approaches by assessing the uniformity and identifying outliers in the weights, and (2) a codebook optimization algorithm that enhances the performance of cluster-based quantization methods for element-wise multiplication in RWKV.Experiments show that RWKVQuant can quantize RWKV-6-14B into about 3-bit with less than 1\% accuracy loss and 2.14$\times$ speed up.
Paperid:1776
Authors:Alessandro Pierro · Steven Abreu · Jonathan Timcheck · Philipp Stratmann · Andreas Wild · Sumit Shrestha
Abstract:
Abstract:Linear recurrent neural networks enable powerful longrange sequence modeling with constant memory usage and time-per-token during inference. These architectures hold promise for streaming applications at the edge, but deployment in resource-constrained environments requires hardware-aware optimizations to minimize latency and energy consumption. Unstructured sparsity offers a compelling solution, enabling substantial reductions in compute and memory requirements--when accelerated by compatible hardware platforms. In this paper, we conduct a scaling study to investigate the Pareto front of performance and efficiency across inference compute budgets.We find that highly sparse linear RNNs *consistently* achieve better efficiency-performance trade-offs than dense baselines, with $2\times$ less compute and $36$\% less memory at iso-accuracy.Our models achieve state-of-the-art results on a real-time streaming task for audio denoising.By quantizing our sparse models to fixed-point arithmetic and deploying them on the Intel Loihi 2 neuromorphic chip for real-time processing, we translate model compression into tangible gains of $42\times$ lower latency and $149\times$ lower energy consumption compared to a dense model on an edge GPU.Our findings showcase the transformative potential of unstructured sparsity, paving the way for highly efficient recurrent neural networks in real-world, resource-constrained environments.
Paperid:1777
Authors:Seongheon Park · Xuefeng Du · Min-Hsuan Yeh · Haobo Wang · Sharon Li
Abstract:
Hallucinations in LLMs pose a significant concern to their safe deployment in realworld applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content.To this end, we propose theTruthfulnessSeparatorVector (TSV), a lightweight and flexible steering vector that reshapes the LLM’s representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters.Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters.It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process.Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.
Paperid:1778
Authors:Roie Reshef · Kfir Levy
Abstract:
This paper addresses the challenge of achieving Differential Privacy (DP) in Federated Learning (FL) under the partialparticipation setting, where each machine participates in only some of training rounds.While earlier work achieved optimal performance and efficiency in full-participation scenarios, these methods could not extend effectively to cases with partial-participation.Our approach addresses this gap by introducing a novel noise-cancellation mechanism that ensures privacy without compromising convergence rates or computational efficiency.We analyze our method within the Stochastic Convex Optimization (SCO) framework and demonstrate that it achieves optimal performance for both homogeneous and heterogeneous data distributions.This work broadens the applicability of DP in FL, providing a practical and efficient solution for privacy-preserving learning in distributed systems with partial participation.
Paperid:1779
Authors:Yi-Fan Zhang · Tao Yu · Haochen Tian · Chaoyou Fu · Peiyan Li · JianshuZeng · Wulin Xie · Yang Shi · Huanyu Zhang · Junkang Wu · xue wang · Yibo Hu · Bin Wen · Tingting Gao · Zhang Zhang · Fan Yang · Di ZHANG · Liang Wang · Rong Jin
Abstract:
Existing efforts to align multimodal large language models (MLLMs) with human preferences have only achieved progress in narrow areas, such as hallucination reduction, but remain limited in practical applicability and generalizability. To this end, we introduceMMRLHF, a dataset containing120kfine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce theCritique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we proposeDynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across10distinct dimensions, encompassing27benchmarks, with results demonstrating significant and consistent improvements in model performance (Figure.1).
Paperid:1780
Authors:Hiroshi Sawada · Kazuo Aoyama · Yuya Hikima
Abstract:
This paper proposes a novel concept of natural perturbations for blackbox training of neural networks by zeroth-order optimization. When a neural network is implemented directly in hardware, training its parameters by backpropagation ends up with an inaccurate result due to the lack of detailed internal information. We instead employ zeroth-order optimization, where the sampling of parameter perturbations is of great importance. The sampling strategy we propose maximizes the entropy of perturbations with a regularization that the probability distribution conditioned by the neural network does not change drastically, by inheriting the concept of natural gradient. Experimental results show the superiority of our proposal on diverse datasets, tasks, and architectures.
Paperid:1781
Authors:Wendong Zheng · Junyang Chen · Husheng Guo · Wenjian Wang
Abstract:
Recently, the potential of lightweight models for resourceconstrained scenarios has garnered significant attention, particularly in safety-critical tasks such as bio-electrical signal classification and B-ultrasound-assisted diagnostic. These tasks are frequently affected by environmental noise due to patient movement artifacts and inherent device noise, which pose significant challenges for lightweight models (e.g., deep binary neural networks (DBNNs)) to perform robust inference. A pertinent question arises: can a well-trained DBNN effectively resist environmental noise during inference? In this study, we find that the DBNN's robustness vulnerability comes from the binary weights and scaling factors. Drawing upon theoretical insights, we propose L1-infinite norm constraints for binary weights and scaling factors, which yield a tighter upper bound compared to existing state-of-the-art (SOTA) methods. Finally, visualization studies show that our approach introduces minimal noise perturbations at the periphery of the feature maps. Our approach outperforms the SOTA method, as validated by several experiments conducted on the bio-electrical and image classification datasets. We hope our findings can raise awareness among researchers about the environmental noise robustness of DBNNs.
Paperid:1782
Authors:Liyuan Liang · Xinyi Chen · Gan Luo · Kun Yuan
Abstract:
A key challenge in decentralized optimization is determining the optimal convergence rate and designing algorithms to achieve it. While this problem has been extensively addressed for doublystochastic and column-stochastic mixing matrices, the row-stochastic scenario remains unexplored. This paper bridges this gap by introducing effective metrics to capture the influence of row-stochastic mixing matrices and establishing the first convergence lower bound for decentralized learning over row-stochastic networks. However, existing algorithms fail to attain this lower bound due to two key issues: deviation in the descent direction from the globally averaged gradient and instability introduced by the PULL-DIAG protocol. To address descent deviation, we propose a novel analysis framework demonstrating that PULL-DIAG-GT achieves linear speedup—the first such result for row-stochastic decentralized optimization. Moreover, by incorporating a multistep gossip (MG) protocol, we resolve the instability issue and attain the lower bound, achieving optimal complexity for decentralized optimization over row-stochastic networks.
Paperid:1783
Authors:Peter Blohm · Patrick Indri · Thomas Gärtner · SAGAR MALHOTRA
Abstract:
Abstract:We propose and investigate probabilistic guarantees for the adversarial robustness of classification algorithms.While traditional formal verification approaches for robustness are intractable and samplingbased approaches do not provide formal guarantees, our approach is able to efficiently certify a probabilistic relaxation of robustness.The key idea is to sample an $\epsilon$-net and invoke a local robustness oracle on the sample.Remarkably, the size of the sample needed to achieve probably approximately global robustness guarantees is independent of the input dimensionality, the number of classes, and the learning algorithm itself.Our approach can, therefore, be applied even to large neural networks that are beyond the scope of traditional formal verification.Experiments empirically confirm that it characterizes robustness better thanstate-of-the-art sampling-based approaches and scales better than formal methods.
Paperid:1784
Authors:Juwei Yue · Haikuo Li · Jiawei Sheng · Xiaodong Li · Taoyu Su · Tingwen Liu · Li Guo
Abstract:
Graph neural networks (GNNs) leverage message passing mechanisms to learn the topological features of graph data. Traditional GNNs learns node features in a spatial domain unrelated to the topology, which can hardly ensure topological features. In this paper, we formulates message passing as a system of hyperbolic partial differential equations (hyperbolic PDEs), constituting a dynamical system that explicitly maps node representations into a particular solution space. This solution space is spanned by a set of eigenvectors describing the topological structure of graphs. Within this system, for any moment in time, a node features can be decomposed into a superposition of the basis of eigenvectors. This not only enhances the interpretability of message passing but also enables the explicit extraction of fundamental characteristics about the topological structure. Furthermore, by solving this system of hyperbolic partial differential equations, we establish a connection with spectral graph neural networks (spectral GNNs), serving as a message passing enhancement paradigm for spectral GNNs.We further introduce polynomials to approximate arbitrary filter functions. Extensive experiments demonstrate that the paradigm of hyperbolic PDEs not only exhibits strong flexibility but also significantly enhances the performance of various spectral GNNs across diverse graph tasks.
Paperid:1785
Authors:Xiyue Zhu · Dou Kwark · Ruike Zhu · Kaiwen Hong · Yiqi Tao · Shirui Luo · Yudu Li · Zhi-Pei Liang · Volodymyr Kindratenko
Abstract:
In volumeto-volume translations in medical images, existing models often struggle to capture the inherent volumetric distribution using 3D voxel-space representations, due to high computational requirements and large dataset demands. We present Score-Fusion, a novel volumetric translation model that effectively learns 3D representations by ensembling perpendicularly trained 2D diffusion models in score function space. By carefully initializing our model to start with an average of 2D models as in TPDM, we reduce 3D training to a fine-tuning process and thereby mitigate both computational and data demands. Furthermore, we explicitly design the 3D model's hierarchical layers to learn ensembles of 2D features, further enhancing efficiency and performance. Moreover, Score-Fusion naturally extends to multi-modality settings, by fusing diffusion models conditioned on different inputs for flexible, accurate integration. We demonstrate that 3D representation is essential for better performance in downstream recognition tasks, such as tumor segmentation, where most segmentation models are based on 3D representation. Extensive experiments demonstrate that Score-Fusion achieves superior accuracy and volumetric fidelity in 3D medical image super-resolution and modality translation. Beyond these improvements, our work also provides broader insight into learning-based approaches for score function fusion.
Paperid:1786
Authors:Huang Huang · Fangchen Liu · Letian Fu · Tingfan Wu · Mustafa Mukadam · Jitendra Malik · Ken Goldberg · Pieter Abbeel
Abstract:
VisionLanguage-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zero-shot generalization to novel objects and environments.
Paperid:1787
Authors:Moshe Eliasof · Alessio Gravina · Andrea Ceni · Claudio Gallicchio · Davide Bacciu · Carola-Bibiane Schönlieb
Abstract:
Graph State Space Models (SSMs) have recently been introduced to enhance Graph Neural Networks (GNNs) in modeling longrange interactions. Despite their success, existing methods either compromise on permutation equivariance or limit their focus to pairwise interactions rather than sequences. Building on the connection between Autoregressive Moving Average (ARMA) and SSM, in this paper, we introduce GRAMA, a Graph Adaptive method based on a learnable ARMA framework that addresses these limitations.By transforming from static to sequential graph data, GRAMA leverages the strengths of the ARMA framework, while preserving permutation equivariance. Moreover, GRAMA incorporates a selective attention mechanism for dynamic learning of ARMA coefficients, enabling efficient and flexible long-range information propagation. We also establish theoretical connections between GRAMA and Selective SSMs, providing insights into its ability to capture long-range dependencies. Extensive experiments on 22 synthetic and real-world datasets demonstrate that GRAMA consistently outperforms backbone models and performs competitively with state-of-the-art methods.
Paperid:1788
Authors:Zhengzheng Lou · Ke Zhang · Yucong Wu · Shizhe Hu
Abstract:
In an era of increasingly diverse information sources, multimodal clustering (MMC) has become a key technology for processing multi-modal data. It can apply and integrate the feature information and potential relationships of different modalities. Although there is a wealth of research on MMC, due to the complexity of datasets, a major challenge remains in how to deeply explore the complex latent information and interdependencies between modalities. To address this issue, this paper proposes a method called super deep contrastive information bottleneck for multi-modal clustering (SDCIB), which aims to explore and utilize all types of latent information to the fullest extent. Specifically, the proposed SDCIB explicitly introduces the rich information contained in the encoder's hidden layers into the loss function for the first time, thoroughly mining both modal features and the hidden relationships between modalities. Moreover, the proposed SDCIB performs dual optimization by simultaneously considering consistency information from both the feature distribution and clustering assignment perspectives, the proposed SDCIB significantly improves clustering accuracy and robustness. We conduct experiments on 4 multi-modal datasets, with the results demonstrate the superiority and clever design of the proposed SDCIB.
Paperid:1789
Authors:Nikita Gushchin · David Li · Daniil Selikhanovych · Evgeny Burnaev · Dmitry Baranchuk · Aleksandr Korotin
Abstract:
Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in imageto-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.
Paperid:1790
Authors:Lily Zhang · Rajesh Ranganath
Abstract:
There has been an explosion of research on the topic of learning from preference data, following the success of RLHF in highprofile language model training efforts. However, preference learning is far from standardized, making it difficult to understand where to focus efforts with respect to current practice or research investment. This work presents a framework to understand preference learning from the ground up, starting from the sampling distribution of pairwise preference comparison data. First, we show that the only evaluation of a generative model that respects both preferences and prevalences in the preference data sampling distribution is win rate. This result suggests that everything should be understood through win rate. Thus, we characterize the space of preference learning into win rate optimization (WRO) and non-WRO objectives, highlighting the theoretical benefits of examples in the former and limitations of examples in the latter. We additionally conduct an empirical analysis of WRO and non-WRO objectives which highlights the practical importance of ease and success of optimization. Our analysis provides insights for existing practice as well as concrete guidance for future research---namely, future efforts should focus on either developing objectives more closely aligned with WRO or improving the optimization of WRO objectives.
Paperid:1791
Authors:Marten Lienen · Abdullah Saydemir · Stephan Günnemann
Abstract:
State space models are emerging as a dominant model class for sequence problems with many relying on the HiPPO framework to initialize their dynamics. However, HiPPO fundamentally assumes data to be noisefree; an assumption often violated in practice. We extend the HiPPO theory with measurement noise and derive an uncertainty-aware initialization for state space model dynamics. In our analysis, we interpret HiPPO as a linear stochastic control problem where the data enters as a noise-free control signal. We then reformulate the problem so that the data become noisy outputs of a latent system and arrive at an alternative dynamics initialization that infers the posterior of this latent system from the data without increasing runtime. Our experiments show that our initialization improves the resistance of state-space models to noise both at training and inference time.
Paperid:1792
Authors:Charlie Tan · Joey Bose · Chen Lin · Leon Klein · Michael Bronstein · Alexander Tong
Abstract:
Scalable sampling of molecular states in thermodynamic equilibrium is a longstanding challenge in statistical physics. Boltzmann generators tackle this problem by pairing normalizing flows (NF) with importance sampling to obtain statistically independent samples under the target distribution. In this paper, we propose Sequential Boltzmann Generators (SBG), in which the importance sampling of Boltzmann generators is replaced with Sequential Monte Carlo (SMC). Notably, this enables the transport of proposal samples towards the target distribution, beyond the reweighting of standard Boltzmann generators. This transport process facilitates inference-time scaling through finer time discretizations. \nameshort necessitates efficient computation of sample likelihoods, leading us to employ an efficient general-purpose Transformer-based normalizing flow operating directly on all-atom Cartesian coordinates. SBG is more computationally efficient as no explicit equivariance constraint is encoded in the NF, but is instead softly introduced via data augmentations. SBG enjoys guaranteed sample improvement by applying SMC along the transport path, leading to higher sampling efficiency. SBG achieves state-of-the-art performance w.r.t. all metrics on molecular systems, demonstrating the first equilibrium sampling in Cartesian coordinates of tri, tetra, and hexapeptides that were so far intractable for prior Boltzmann generators.
Paperid:1793
Authors:Simone Drago · Marco Mussi · Alberto Maria Metelli
Abstract:
Abstract:In the standard Reinforcement Learning (RL) paradigm, the action space is assumed to be fixed and immutable throughout the learning process. However, in many realworld scenarios, not all actions are available at every decision stage. The available action set may depend on the current environment state, domain-specific constraints, or other (potentially stochastic) factors outside the agent's control. To address these realistic scenarios, we introduce a novel paradigm called *Sleeping Reinforcement Learning*, where the available action set varies during the interaction with the environment. We start with the simpler scenario in which the available action sets are revealed at the beginning of each episode. We show that a modification of UCBVI achieves regret of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, where $H$ is the horizon, $S$ and $A$ are the cardinalities of the state and action spaces, respectively, and $T$ is the learning horizon. Next, we address the more challenging and realistic scenario in which the available actions are disclosed only at each decision stage. By leveraging a novel construction, we establish a minimax lower bound of order $\Omega(\sqrt{T 2^{A/2}})$ when the availability of actions is governed by a Markovian process, establishing a statistical barrier of the problem. Focusing on the statistically tractable case where action availability depends only on the current state and stage, we propose a new optimistic algorithm that achieves regret guarantees of order $\widetilde{\mathcal{O}}(HA\sqrt{ST})$.
Paperid:1794
Authors:Peiyuan Zhang · Yongqi Chen · Runlong Su · Hangliang Ding · Ion Stoica · Zhengzhong Liu · Hao Zhang
Abstract:
Diffusion Transformers (DiTs) with 3D full attention power stateof-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 950 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being \emph{hardware-efficient}. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79\% MFU -- 7.17× faster than prior art methods. On the leading video DiT model, Hunyuan, it accelerates attention by 1.6–10x over FlashAttention-3, yielding a 1.36–3.53× end-to-end speedup with no or minimum quality loss.
Paperid:1795
Authors:Xinyan Liang · Shijie Wang · Yuhua Qian · Qian Guo · Liang Du · Bingbing Jiang · Tingjin Luo · Feijiang Li
Abstract:
Multiview classification (MVC) based on the Dempster-Shafer theory has gained significant recognition for its reliability in safety-critical applications. However, existing methods predominantly focus on providing confidence levels for decision outcomes without explaining the reasoning behind these decisions. Moreover, the reliance on first-order statistical magnitudes of belief masses often inadequately capture the intrinsic uncertainty within the evidence. To address these limitations, we propose a novel framework termed Trusted Multi-view Classification Constrained with Expert Knowledge (TMCEK). TMCEK integrates expert knowledge to enhance feature-level interpretability and introduces a distribution-aware subjective opinion mechanism to derive more reliable and realistic confidence estimates. The theoretical superiority of the proposed uncertainty measure over conventional approaches is rigorously established. Extensive experiments conducted on three multi-view datasets for sleep stage classification demonstrate that TMCEK achieves state-of-the-art performance while offering interpretability at both the feature and decision levels. These results position TMCEK as a robust and interpretable solution for MVC in safety-critical domains. (The code will be made publicly available.)
Paperid:1796
Authors:Xupeng Zhu · Fan Wang · Robin Walters · Jane Shi
Abstract:
Diffusion Policies are effective at learning closedloop manipulation policies from human demonstrations but generalize poorly to novel arrangements of objects in 3D space, hurting real-world performance. To address this issue, we propose Spherical Diffusion Policy (SDP), an SE(3) equivariant diffusion policy that adapts trajectories according to 3D transformations of the scene. Such equivariance is achieved by embedding the states, actions, and the denoising process in spherical Fourier space. Additionally, we employ novel spherical FiLM layers to condition the action denoising process equivariantly on the scene embeddings. Lastly, we propose a spherical denoising temporal U-net that achieves spatiotemporal equivariance with computational efficiency. In the end, SDP is end-to-end SE(3) equivariant, allowing robust generalization across transformed 3D scenes. SDP demonstrates a large performance improvement over strong baselines in 20 simulation tasks and 5 physical robot tasks including single-arm and bi-manual embodiments.
Paperid:1797
Authors:Haihan Zhang · Weicheng Lin · Yuanshi Liu · Cong Fang
Abstract:
Abstract:This paper considers a canonical problem in kernel regression: how good are the model performances when it is trained by the popular online firstorder algorithms, compared to the offline ones, such as ridge and ridgeless regression? In this paper, we analyze the foundational single-pass Stochastic Gradient Descent (SGD) in kernel regression under various source conditions where the optimal predictor can even not belong to the RKHS, i.e. the model is misspecified. Specifically, we focus on the inner product kernel over the sphere and characterize the exact orders of the excess risk curves under different scales of sample sizes $n$ concerning the input dimension $d$. Surprisingly, we show that SGD achieves min-max optimal rates up to constants among all the scales, $without$ suffering the saturation, a prevalent phenomenon observed in (ridge) regression, except when the model is highly misspecified and the learning is in a final stage where $n\gg d^\gamma$ with any constant $\gamma >0$. The main reason for SGD to overcome the curse of saturation is the exponentially decaying step size schedule, a common practice in deep neural network training. As a byproduct, we provide the $first$ provable advantage of the scheme over the last iterative averaging method in the common setting.
Paperid:1798
Authors:Henry Moss · Sebastian Ober · Tom Diethe
Abstract:
Bayesian optimisation in the latent space of a VAE is a powerful framework for optimisation tasks over complex structured domains, such as the space of valid molecules. However, existing approaches tightly couple the surrogate and generative models, which can lead to suboptimal performance when the latent space is not tailored to specific tasks, which in turn has led to the proposal of increasingly sophisticated algorithms. In this work, we explore a new direction, instead proposing a decoupled approach that trains a generative model and a GP surrogate separately, then combines them via a simple yet principled Bayesian update rule. This separation allows each component to focus on its strengths— structure generation from the VAE and predictive modelling by the GP. We show that our decoupled approach improves our ability to identify highpotential candidates in molecular optimisation problems under constrained evaluation budgets.
Paperid:1799
Authors:MYUNG-SIK CHO · Jong Eui Park · Jeonghye Kim · Youngchul Sung
Abstract:
Multitask reinforcement learning (RL) encounters significant challenges due to varying task complexities and their reward distributions from the environment. To address these issues, in this paper, we propose Adaptive Reward Scaling (ARS), a novel framework that dynamically adjusts reward magnitudes and leverages a periodic network reset mechanism. ARS introduces a history-based reward scaling strategy that ensures balanced reward distributions across tasks, enabling stable and efficient training. The reset mechanism complements this approach by mitigating overfitting and ensuring robust convergence. Empirical evaluations on the Meta-World benchmark demonstrate that ARS significantly outperforms baseline methods, achieving superior performance on challenging tasks while maintaining overall learning efficiency. These results validate ARS's effectiveness in tackling diverse multi-task RL problems, paving the way for scalable solutions in complex real-world applications.
Paperid:1800
Authors:Yu-Neng Chuang · Helen Zhou · Prathusha Sarma · Parikshit Gopalan · John Boccio · Sara Bolouki · Xia Hu
Abstract:
Large language models (LLMs) have demonstrated impressive performance on several tasks and are increasingly deployed in realworld applications. However, especially in high-stakes settings, it becomes vital to know when the output of an LLM may be unreliable. Depending on whether an answer is trustworthy, a system can then choose to route the question to another expert, or otherwise fall back on a safe default behavior. In this work, we study the extent to which LLMs can reliably indicate confidence in their answers, and how this notion of confidence can translate into downstream accuracy gains. We propose Self-REF, a lightweight training strategy to teach LLMs to express confidence in whether their answers are correct in a reliable manner. Self-REF introduces confidence tokens into the LLM, from which a confidence score can be extracted. Compared to conventional approaches such as verbalizing confidence and examining token probabilities, we demonstrate empirically that confidence tokens show significant improvements in downstream routing and rejection learning tasks.
Paperid:1801
Authors:Abhinav Agrawal · Justin Domke
Abstract:
Predictive posterior densities (PPDs) are essential in approximate inference for quantifying predictive uncertainty and comparing inference methods. Typically, PPDs are estimated by simple Monte Carlo (MC) averages. In this paper, we expose a critical underrecognized issue: the signal-to-noise ratio (SNR) of the simple MC estimator can sometimes be extremely low, leading to unreliable estimates. Our main contribution is a theoretical analysis demonstrating that even with exact inference, SNR can decay rapidly with an increase in (a) the mismatch between training and test data, (b) the dimensionality of the latent space, or (c) the size of test data relative to training data. Through several examples, we empirically verify these claims and show that these factors indeed lead to poor SNR and unreliable PPD estimates (sometimes, estimates are off by hundreds of nats even with a million samples). While not the primary focus, we also explore an adaptive importance sampling approach as an illustrative way to mitigate the problem, where we learn the proposal distribution by maximizing a variational proxy to the SNR. Taken together, our findings highlight an important challenge and provide essential insights for reliable estimation.
Paperid:1802
Authors:Yuheng Lai · Leying Guan
Abstract:
Equalized odds, an important notion of algorithmic fairness, aims to ensure that sensitive variables, such as race and gender, do not unfairly influence the algorithm's prediction when conditioning on the true outcome. Despite rapid advancements, current research primarily focuses on equalized odds violations caused by a single sensitive attribute, leaving the challenge of simultaneously accounting for multiple attributes underaddressed. We bridge this gap by introducing an in-processing fairness-aware learning approach, FairICP, which integrates adversarial learning with a novel inverse conditional permutation scheme. FairICP offers a flexible and efficient scheme to promote equalized odds under fairness conditions described by complex and multi-dimensional sensitive attributes. The efficacy and adaptability of our method are demonstrated through both simulation studies and empirical analyses of real-world datasets.
Paperid:1803
Authors:Andrei Muresanu · Anvith Thudi · Michael Zhang · Nicolas Papernot
Abstract:
Modern machine learning models are expensive to train, and there is a growing concern about the challenge of retroactively removing specific training data. Achieving exact unlearning in deep learning pipelines—producing models as if certain data had never been included in training—remains an open problem. In this paper, we revisit exact unlearning in deep learning and show that for large language models (LLMs) we can efficiently exactly unlearn ``finetuning data" (the data used to adapt a pre-trained model). This follows from two observations. First, we can use in-context learning to adapt the LLM to the fine-tuning dataset instead of SGD based algorithms. Second, we show that accurate in-context learning can be done with quantized k-means, which allows for effectively constant time unlearning operations. Our empirical evaluation shows that this unlearning recipe has similar performance to fine-tuning alternatives, but vastly reduces the unlearning costs. Our study also highlights the need for new measures of unlearning cost when adapting the learning algorithm to have faster unlearn operations.
Paperid:1804
Authors:Junghyun Lee · Kyoungseok Jang · Kwang-Sung Jun · Milan Vojnovic · Se-Young Yun
Abstract:
Abstract:We present `GLLowPopArt,` a novel Catoni-style estimator for generalized linear low-rank trace regression. Building on `LowPopArt,` (Jang et al., 2024), it employs a two-stage approach: nuclear norm regularization followed by matrix Catoni estimation. We establish state-of-the-art estimation error bounds, surpassing existing guarantees (Fan et al., 2019; Kang et al., 2022), and reveal a novel experimental design objective, $\mathrm{GL}(\pi)$. The key technical challenge is controlling bias from the nonlinear inverse link function, which we address by our two-stage approach. We prove a *local* minimax lower bound, showing that our `GL-LowPopArt,` enjoys instance-wise optimality up to the condition number of the ground-truth Hessian. Applications include generalized linear matrix completion, where `GL-LowPopArt,` achieves a state-of-the-art Frobenius error guarantee, and **bilinear dueling bandits**, a novel setting inspired by general preference learning (Zhang et al., 2024). Our analysis of a `GL-LowPopArt,`-based explore-then-commit algorithm reveals a new, potentially interesting problem-dependent quantity, along with improved Borda regret bound than vectorization (Wu et al., 2024).
Paperid:1805
Authors:Yanbo Wang · Xiyuan Wang · Quan Gan · Minjie Wang · Qibin Yang · David Wipf · Muhan Zhang
Abstract:
We introduce Griffin, the first foundation model designed specifically for Relational Databases (RDBs). Unlike previous smaller models focused on single RDB tasks, Griffin unifies the input encoder and prediction heads to handle diverse tasks. Additionally, we enhance the architecture by incorporating a crossattention module and a novel aggregator. Griffin utilizes pretraining on both single-table and RDB datasets, employing advanced encoders for categorical, numerical, and metadata features, along with innovative components such as cross-attention modules and enhanced message-passing neural networks (MPNNs) to capture the complexities of relational data. Evaluated on large-scale, heterogeneous, and temporal graphs extracted from RDBs across various domains (spanning over 150 million nodes), Griffin demonstrates superior or comparable performance to individually trained models, excels in low-data scenarios, and shows strong transferability with similarity and diversity in pretraining across new datasets and tasks, highlighting its potential as a universally applicable foundation model for RDBs.
Paperid:1806
Authors:Nurbek Tastan · Samuel Horváth · Karthik Nandakumar
Abstract:
Collaborative learning enables multiple participants to learn a single global model by exchanging focused updates instead of sharing data. One of the core challenges in collaborative learning is ensuring that participants are rewarded fairly for their contributions, which entails two key subproblems: contribution assessment and reward allocation. This work focuses on fair reward allocation, where the participants are incentivized through model rewards - differentiated final models whose performance is commensurate with the contribution. In this work, we leverage the concept of slimmable neural networks to collaboratively learn a shared global model whose performance degrades gracefully with a reduction in model width. We also propose a post-training fair allocation algorithm that determines the model width for each participant based on their contributions. We theoretically study the convergence of our proposed approach and empirically validate it using extensive experiments on different datasets and architectures. We also extend our approach to enable training-time model reward allocation.
Paperid:1807
Authors:Sumin Cho · Dongwon Kim · Kwangsu Kim
Abstract:
Domain Generalization (DG) aims to train models that generalize to unseen target domains but often overfit to domainspecific features, known as spurious correlations. Gradient-based DG methods typically guide gradients in a dominant direction but may inadvertently reinforce undesired correlations. Recent work employed dropout to regularize parameters with over-predictive directions, but did not explicitly adjust gradient alignment or ensure balanced parameter updates. In this paper, we propose GENIE (Generalization-ENhancing Iterative Equalizer), a novel optimizer that leverages the One-Step Generalization Ratio (OSGR) to quantify each parameter's contribution to loss reduction and assess gradient alignment. By dynamically equalizing OSGR through a preconditioning factor, GENIE prevents a small subset of parameters from dominating optimization, promoting domain-invariant feature learning. Theoretically, it balances convergence contribution and gradient alignment among parameters, achieving a higher OSGR while maintaining SGD’s convergence rate. Empirically, it outperforms existing optimizers and enhances performance when integrated with various DG and single DG methods.
Paperid:1808
Authors:Yifan Sun · Han Wang · Dongbai Li · Gang Wang · Huan Zhang
Abstract:
Benchmark Data Contamination (BDC)—the inclusion of benchmark testing samples in the training set—has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics—fidelityandcontamination resistance—to provide a finegrained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizingquestion-levelevaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy significantly improves resistance over the vanilla case (i.e., no benchmark update) acrossallbenchmarks, and none effectively balances fidelity and contamination resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies.
Paperid:1809
Authors:Yuxiao Qu · Matthew Yang · Amrith Setlur · Lewis Tunstall · Edward Beeching · Ruslan Salakhutdinov · Aviral Kumar
Abstract:
Training models to efficiently use testtime compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute from the lens of exploration and exploitation. It also motivates the use of cumulative regret to measure the efficacy of test-time compute by viewing a long output stream as consisting of several episodes from the model. While current state-of-the-art models do not optimize regret, we show that regret can be minimized by running final 0/1 reward RL regularized by a dense reward bonus, given by the "information gain" from each subsequent block in the output stream. We prescribe an approach for quantifying information gain, which measures the utility of an intermediate segment of tokens towards improving accuracy of the final answer. We instantiate this idea to develop MRT, a new class of finetuning methods for optimizing test-time compute. Fine-tuning with MRT leads to substantial improvements in both performance and token efficiency on the AIME dataset.
Paperid:1810
Authors:Shubham Gupta · Thibaut Durand · Graham Taylor · Lilian Bialokozowicz
Abstract:
We present a novel prompt design for Large Language Models (LLMs) tailored toAsynchronous Time Series. Unlike regular time series, which assume values at evenly spaced time points, asynchronous time series consist of events occurring at irregular intervals, each described in natural language. Our approach effectively utilizes the rich natural language of event descriptions, allowing LLMs to benefit from their broad world knowledge for reasoning across different domains and tasks. This allows us to extend the scope of asynchronous time series analysis beyond forecasting to include tasks like anomaly detection and data imputation.We further introduceStochastic Soft Prompting, a novel prompttuning mechanism that significantly improves model performance, outperforming existing finetuning methods such as QLORA. Through extensive experiments on real-world datasets, we demonstrate that our approach achieves state-of-the-art performanceacross different tasks and datasets.
Paperid:1811
Authors:Tingyi Cai · Yunliang Jiang · Ming Li · Lu Bai · Changqin Huang · Yi Wang
Abstract:
With the growing adoption of Hypergraph Neural Networks (HNNs) to model higherorder relationships in complex data, concerns about their security and robustness have become increasingly important. However, current security research often overlooks the unique structural characteristics of hypergraph models when developing adversarial attack and defense strategies. To address this gap, we demonstrate that hypergraphs are particularly vulnerable to node injection attacks, which align closely with real-world applications. Through empirical analysis, we develop a novel unnoticeable attack approach by monitoring changes in homophily and leveraging this self-regulating property to enhance stealth. Building on these insights, we introduce HyperNear, i.e., unnoticeable Node InjEction Attacks on HypeRgraph neural networks, the first node injection attack framework specifically tailored for HNNs. HyperNear integrates homophily-preserving strategies to optimize both stealth and attack effectiveness. Extensive experiments show that HyperNear achieves excellent performance and generalization, marking the first comprehensive study of adversarial attacks on hypergraphs in a black-box setting. Our code is available at https://anonymous.4open.science/r/HyperNear-C0CA.
Paperid:1812
Authors:Yichuan Deng · Xiaoyu Li · Zhao Song · OMRI WEINSTEIN
Abstract:
Abstract:A recent work by [Larsen, SODA 2023] introduced a faster combinatorial alternative to Bansal's SDP algorithm for finding a coloring $x \in \{1, 1\}^n$ that approximately minimizes the discrepancy $\mathrm{disc}(A, x) := | A x |_{\infty}$ of a real-valued $m \times n$ matrix $A$. Larsen's algorithm runs in $\widetilde{O}(mn^2)$ time compared to Bansal's $\widetilde{O}(mn^{4.5})$-time algorithm, with a slightly weaker logarithmic approximation ratio in terms of the hereditary discrepancy of $A$ [Bansal, FOCS 2010]. We present a combinatorial $\widetilde{O}(\mathrm{nnz}(A) + n^3)$-time algorithm with the same approximation guarantee as Larsen's, optimal for tall matrices where $m = \mathrm{poly}(n)$. Using a more intricate analysis and fast matrix multiplication, we further achieve a runtime of $\widetilde{O}(\mathrm{nnz}(A) + n^{2.53})$, breaking the cubic barrier for square matrices and surpassing the limitations of linear-programming approaches [Eldan and Singh, RS\&A 2018]. Our algorithm relies on two key ideas: (i) a new sketching technique for finding a projection matrix with a short $\ell_2$-basis using implicit leverage-score sampling, and (ii) a data structure for efficiently implementing the iterative Edge-Walk partial-coloring algorithm [Lovett and Meka, SICOMP 2015], and using an alternative analysis to enable ``lazy'' batch updates with low-rank corrections. Our results nearly close the computational gap between real-valued and binary matrices, for which input-sparsity time coloring was recently obtained by [Jain, Sah and Sawhney, SODA 2023].
Paperid:1813
Authors:Yifeng Wang · Xueying Zhan · Siyu Huang
Abstract:
As deep learning continues to evolve, the need for data efficiency becomes increasingly important. Considering labeling large datasets is both timeconsuming and expensive, active learning (AL) provides a promising solution to this challenge by iteratively selecting the most informative subsets of examples to train deep neural networks, thereby reducing the labeling cost. However, the effectiveness of different AL algorithms can vary significantly across data scenarios, and determining which AL algorithm best fits a given task remains a challenging problem. This work presents the first differentiable AL strategy search method, named AutoAL, which is designed on top of existing AL sampling strategies. AutoAL consists of two neural nets, named SearchNet and FitNet, which are optimized concurrently under a differentiable bi-level optimization framework. For any given task, SearchNet and FitNet are iteratively co-optimized using the labeled data, learning how well a set of candidate AL algorithms perform on that task. With the optimal AL strategies identified, SearchNet selects a small subset from the unlabeled pool for querying their annotations, enabling efficient training of the task model. Experimental results demonstrate that AutoAL consistently achieves superior accuracy compared to all candidate AL algorithms and other selective AL approaches, showcasing its potential for adapting and integrating multiple existing AL methods across diverse tasks and domains.
Paperid:1814
Authors:Mahavir Dabas · Si Chen · Charles Fleming · Ming Jin · Ruoxi Jia
Abstract:
Safety alignment is crucial for Large Language Models (LLMs) to resist malicious instructions but often results in overrefusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. To this end, we introduceACTOR(Activation-Based Training for Over-Refusal Reduction), a robust and compute- and-data efficient training framework that mini- mizes over-refusals by utilizing internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model’s ability to handle harmful queries and preserving overall utility.
Paperid:1815
Authors:Songtao Lu
Abstract:
Extensive research has showcased that a wide range of machine learning problems can be formulated as bilevel optimization, where two levels of learning processes intertwine through distinct sets of optimization variables. However, prevailing approaches often impose stringent assumptions, such as strong convexity of the lower level loss function or uniqueness of the optimal solution, to enable algorithmic development and convergence analysis. These assumptions, though, tend to be overly restrictive in realworld scenarios. In this work, we explore a recently popularized Moreau envelope based reformulation of bilevel optimization problems, accommodating nonconvex objective functions at both levels. We introduce a smoothed primal-dual first-order stochastic algorithm capable of effectively finding Karush-Kuhn-Tucker (KKT) points for this class of nonconvex bilevel optimization problems. A key feature of our algorithm is its ability to dynamically weigh the lower level problems, enhancing its performance particularly in stochastic learning scenarios. Numerical experiments underscore the superiority of our proposed algorithm over existing penalty-based methods in terms of both convergence rate and test accuracy.
Paperid:1816
Authors:Ting Jiang · Hancheng Ye · Yixiao Wang · Zishan Shao · Jingwei Sun · Jingyang Zhang · Jianyi Zhang · Zekai Chen · Yiran Chen · Hai Li
Abstract:
Abstract:Diffusion models have achieved remarkable success in generative tasks but suffer from high computational costs due to their iterative sampling process and quadratic complexity in attention mechanisms. In this paper, we propose \textbf{Stabilityguided Adaptive Diffusion Acceleration (SADA)}, a novel paradigm that unifies step-wise and token-wise sparsity decisions using a shared criterion based on the denoised latent $x_0$. By aligning with modern numerical solvers that rely heavily on $x_0$, SADA offers more stable pruning decisions and preserves important visual details throughout the denoising trajectory. Moreover, SADA includes dedicated mechanisms to approximate both skipped steps and pruned tokens, substantially narrowing the distribution gap caused by aggressive sparsity. Extensive experiments on Stable Diffusion 2 and Stable Diffusion XL demonstrate that SADA significantly accelerates inference without compromising image quality. Notably, SADA achieves an LPIPS of 0.118 on SDXL, outperforming some popular used training-free methods (e.g., DeepCache) while reducing inference time by up to 1.52×.
Paperid:1817
Authors:Yuan Li · Jun Hu · Zemin Liu · Bryan Hooi · Jia Chen · Bingsheng He
Abstract:
Graph Neural Networks (GNNs) face significant computational challenges when handling largescale graphs. To address this, Graph Condensation (GC) methods aim to compress large graphs into smaller, synthetic ones that are more manageable for GNN training. Recently, trajectory matching methods have shown state-of-the-art (SOTA) performance for GC, aligning the model's training behavior on a condensed graph with that on the original graph by guiding the trajectory of model parameters. However, these approaches require repetitive GNN retraining during condensation, making them computationally expensive. To address the efficiency issue, we completely bypass trajectory matching and propose a novel two-stage framework. The first stage, a precomputation stage, performs one-time message passing to extract structural and semantic information from the original graph. The second stage, a diversity-aware adaptation stage, performs class-wise alignment while maximizing the diversity of synthetic features. Remarkably, even with just the precomputation stage, which takes only seconds, our method either matches or surpasses five out of six baseline results. Extensive experiments show that our approach achieves comparable or better performance while being 96× to 2,455× faster than SOTA methods, making it more practical for large-scale GNN applications.
Paperid:1818
Authors:Chen Huang · Skyler Seto · Hadi Pouransari · Mehrdad Farajtabar · Raviteja Vemulapalli · Fartash Faghri · Oncel Tuzel · Barry-John Theobald · Joshua M Susskind
Abstract:
Vision foundation models pretrained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue ofconcept forgettingon other tasks. Recent methods of robust fine-tuning aim to mitigate forgetting of prior knowledge without affecting the fine-tuning performance. Knowledge is often preserved by matching the original and fine-tuned model weights or feature pairs. However, such point-wise matching can be too strong, without explicit awareness of the feature neighborhood structures that encode rich knowledge as well. We propose a novel regularization methodProxy-FDAthat explicitly preserves the structural knowledge in feature space. Proxy-FDA performs Feature Distribution Alignment (using nearest neighbor graphs) between the pre-trained and fine-tuned feature spaces, and the alignment is further improved by informative proxies that are generated dynamically to increase data diversity. Experiments show that Proxy-FDA significantly reduces concept forgetting during fine-tuning, and we find a strong correlation between forgetting and a distributional distance metric (in comparison to L2 distance). We further demonstrate Proxy-FDA's benefits in various fine-tuning settings (end-to-end, few-shot and continual tuning) and across different tasks like image classification, captioning and VQA.
Paperid:1819
Authors:Jiwoo Hong · Noah Lee · Eunki Kim · Guijin Son · Woojin Chung · Aman Gupta · Shao Tang · James Thorne
Abstract:
The BradleyTerry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss as one-way classifiers are prone to over-optimization, losing generalizability to unseen inputs. In this paper, we study the cause of over-optimization and its downstream effects on the RLHF procedure, highlighting the importance of robustness in RMs. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Correspondingly, we propose batch-wise sum-to-zero regularization (BSR) that enforces reward sum for each batch to be zero-centered, constraining the rewards with abnormally large magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness on unseen inputs. Then, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5\% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0, with reducing generation length by 40\% while adding a 7\% increase in win rate, further highlights that robustness in RMs induces robustness in RLHF training.
Paperid:1820
Authors:Yu-Jie Zhang · Peng Zhao · Masashi Sugiyama
Abstract:
Abstract:Nonstationary online learning has drawn much attention in recent years. Despite considerable progress, dynamic regret minimization has primarily focused on convex functions, leaving the ``curved'' functions (e.g., squared or logistic loss) underexplored. In this work, we address this gap by showing that the regret can be substantially improved by leveraging the concept of *mixability*, a property that generalizes exp-concavity to effectively capture loss curvature. Let $d$ denote the dimensionality and $P_T$ the path length of comparators that reflects the environmental non-stationarity. We demonstrate that a simple exponential-weights method with fixed-share updates achieves an $\mathcal{O}(d T^{1/3} P_T^{2/3} \log(T/P_T))$ dynamic regret for mixable losses, improving upon the best-known $\mathcal{O}(d^{10/3} T^{1/3} P_T^{2/3} \log T)$ result (Baby & Wang, 2021) in both $d$ and $P_T$. More importantly, this improvement arises from a simple yet powerful analytical framework that exploits the mixability, avoiding the complicated Karush–Kuhn–Tucker-based analysis of the previous work. Additionally, we illustrate the versatility of our framework by extending it to the gradient-norm bound, broadening its applicability to diverse learning scenarios.
Paperid:1821
Authors:Yuki Takezawa · Xiaowen Jiang · Anton Rodomanov · Sebastian Stich
Abstract:
Reducing communication complexity is critical for efficient decentralized optimization. The proximal decentralized optimization (PDO) framework is particularly appealing, as methods within this framework can exploit functional similarity among nodes to reduce communication rounds. Specifically, when local functions at different nodes are similar, these methods achieve faster convergence with fewer communication steps. However, existing PDO methods often require highly accurate solutions to subproblems associated with the proximal operator, resulting in significant computational overhead. In this work, we propose the Stabilized Proximal Decentralized Optimization (SPDO) method, which achieves stateof-the-art communication and computational complexities within the PDO framework. Additionally, we refine the analysis of existing PDO methods by relaxing subproblem accuracy requirements and leveraging average functional similarity. Experimental results demonstrate that SPDO significantly outperforms existing methods.
Paperid:1822
Authors:David Dai · Peilin Chen · Malinda Lu · Daniel A. Li · Haowen Wei · Hejie Cui · Paul Pu Liang
Abstract:
Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of largescale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-scale Integrative Multimodal Benchmark (CLIMB), a comprehensive clinical benchmark unifying diverse clinical data across imaging, language, temporal, and graph modalities. CLIMB comprises 4.51 million patient samples totaling 19.01 terabytes distributed across 2D imaging, 3D video, time series, graphs, and multimodal data. Through extensive empirical evaluation, we demonstrate that multitask pretraining significantly improves performance on understudied domains, achieving up to 29% improvement in ultrasound and 23% in ECG analysis over single-task learning. Pretraining on CLIMB also effectively improves models' generalization capability to new tasks, and strong unimodal encoder performance translates well to multimodal performance when paired with task-appropriate fusion strategies. Our findings provide a foundation for new architecture designs and pretraining strategies to advance clinical AI research. Code is released at https://anonymous.4open.science/r/climb_submission-5D0E/.
Paperid:1823
Authors:Xinyu Luo · Cedar Site Bai · Bolian Li · Petros Drineas · Ruqi Zhang · Brian Bullins
Abstract:
Abstract:While popular optimization methods such as SGD, AdamW, and Lion depend on steepest descent updates in either $\ell_2$ or $\ell_\infty$ norms, there remains a critical gap in handling the nonEuclidean structure observed in modern deep networks training. In this work, we address this need by introducing a new accelerated $\ell_p$ steepest descent algorithm, called Stacey, which uses interpolated primal-dual iterate sequences to effectively navigate non-Euclidean smooth optimization tasks. In addition to providing novel theoretical guarantees for the foundations of our algorithm, we empirically compare our approach against these popular methods on tasks including image classification and language model (LLM) pretraining, demonstrating both faster convergence and higher final accuracy. We further evaluate different values of $p$ across various models and datasets, underscoring the importance and efficiency of non-Euclidean approaches over standard Euclidean methods.
Paperid:1824
Authors:Cailong Hua · Sivaraman Rajaganapathy · Rebecca Slick · Joseph Vavra · Joseph Muretta · James Ervasti · Murti Salapaka
Abstract:
Abstract:Deciphering protein folding and unfolding pathways under tension is essential for deepening our understanding of fundamental biological mechanisms. Such insights hold the promise of developing treatments for a range of debilitating and fatal conditions, including muscular disorders like Duchenne Muscular Dystrophy and neurodegenerative diseases such as Parkinson's disease. Single molecule force spectroscopy (SMFS) is a powerful technique for investigating forces involved in protein domains folding and unfolding. However, SMFS trials often involve multiple protein molecules, necessitating filtering to isolate measurements from singlemolecule trials. Currently, manual visual inspection is the primary method for classifying single-molecule data; a process that is both time-consuming and requires significant expertise. In this work, we introduce a novel deep learning model employing a dual-branch fusion strategy; one branch integrates the physics of protein molecules, and the other operates independently of physical constraints. This model automates the isolation of single-molecule measurements, significantly enhancing data processing efficiency. To train and validate our approach, we developed a physics-based Monte Carlo engine to simulate force spectroscopy datasets, including trials involving single molecules, multiple molecules, and no molecules. Our model achieves state-of-the-art performance, outperforming five baseline methods on both simulated and experimental datasets. It attains nearly 100\% accuracy across all simulated datasets and an average accuracy of $79.6 \pm 5.2$\% on experimental datasets, using only $\sim$30 training samples, surpassing baseline methods by 11.4\%. Notably, even without expert annotations on experimental data, the model achieves an average accuracy of $72.0 \pm 5.9$\% when pre-trained on corresponding simulated datasets. With our deep learning approach, the time required to extract meaningful statistics from single-molecule SMFS trials is reduced from a day to under an hour. This work results in SMFS experimental datasets from four important protein molecules crucial to many biological pathways. To support further research, we have made our datasets publicly available and provided a Python-based toolbox (https://anonymous.4open.science/r/AFM_ML-2B8C).
Paperid:1825
Authors:Jinyao Guo · Chengpeng Wang · Xiangzhe Xu · Zian Su · Xiangyu Zhang
Abstract:
Code auditing is a code review process with the goal of finding bugs. Large Language Models (LLMs) have shown substantial potential in this task, offering the ability to analyze programs without compilation and enabling customized bug detection following specified prompts. However, applying LLMs to repositorylevel code auditing presents notable challenges. The inherent context limits and hallucinations of LLMs can lead to the low quality of bug reports. Meanwhile, the large size of software repositories introduces substantial time and token costs, hindering efficiency and scalability in real-world scenarios.This work introduces an autonomous LLM-agent, RepoAudit, designed to enable precise and efficient repository-level code auditing. Equipped with the agent memory, RepoAudit explores the code repository on demand, analyzing data-flow facts along different feasible program paths in individual functions. It also introduces the validator to check the data-flow facts for hallucination mitigation and examine the satisfiability of path conditions of potential buggy paths, which enables RepoAudit to discard false positives. It is shown that RepoAudit powered by Claude 3.5 Sonnet successfully finds 38 true bugs in 15 real-world systems, consuming 0.44 hours and $2.54 per project on average, outperforming existing techniques, including the industrial tools Meta Infer and Amazon CodeGuru.
Paperid:1826
Authors:Jingyuan Wang · Zhimei Ren · Ruohan Zhan · Zhengyuan Zhou
Abstract:
Abstract:Distributionally robust policy learning aims to find a policy that performs well under the worstcase distributional shift, and yet most existing methods for robust policy learning consider the worst-case *joint* distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studies a more nuanced problem --- robust policy learning under the *concept drift*, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root-$n$ rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class $\Pi$, and show that the sub-optimality gap of the proposed algorithm is of the order $\kappa(\Pi)n^{-1/2}$, where $\kappa(\Pi)$ is the entropy integral of $\Pi$ under the Hamming distance and $n$ is the sample size. A matching lower bound is provided to show the optimality of the rate. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.
Paperid:1827
Authors:Kazuki Egashira · Robin Staab · Mark Vero · Jingxuan He · Martin Vechev
Abstract:
Abstract:With the increasing size of frontier LLMs, posttraining quantization has become the standard for memory-efficient deployment. Recent work has shown that basic rounding-based quantization schemes pose security risks, as they can be exploited to inject malicious behaviors into quantized models that remain hidden in full precision. However, existing attacks cannot be applied to more complex quantization methods, such as the GGUF family used in the popular ollama and llama.cpp frameworks. In this work, we address this gap by introducing the first attack on GGUF. Our key insight is that the quantization error -- the difference between the full-precision weights and their (de-)quantized version -- provides sufficient flexibility to construct malicious quantized models that appear benign in full precision. Leveraging this, we develop an attack that trains the target malicious LLM while constraining its weights based on quantization errors. We demonstrate the effectiveness of our attack on three popular LLMs across nine GGUF quantization data types on three diverse attack scenarios: insecure code generation ($\Delta$=$88.7\%$), targeted content injection ($\Delta$=$85.0\%$), and benign instruction refusal ($\Delta$=$30.1\%$). Our attack highlights that (1) the most widely used post-training quantization method is susceptible to adversarial interferences, and (2) the complexity of quantization schemes alone is insufficient as a defense.
Paperid:1828
Authors:Ruhan Wang · Zhiyong Wang · Chengkai Huang · Rui Wang · Tong Yu · Lina Yao · John C. S. Lui · Dongruo Zhou
Abstract:
For questionanswering (QA) tasks, in-context learning (ICL) enables language models (LMs) to generate responses without modifying their parameters by leveraging examples provided in the input. However, the effectiveness of ICL heavily depends on the availability of high-quality examples, which are often scarce due to data privacy constraints, annotation costs, and distribution disparities. A natural solution is to utilize examples stored on client devices, but existing approaches either require transmitting model parameters—incurring significant communication overhead—or fail to fully exploit local datasets, limiting their effectiveness. To address these challenges, we propose Federated In-Context Learning (Fed-ICL), a general framework that enhances ICL through an iterative, collaborative process. Fed-ICL progressively refines responses by leveraging multi-round interactions between clients and a central server, improving answer quality without the need to transmit model parameters. We establish theoretical guarantees for the convergence of Fed-ICL and conduct extensive experiments on standard QA benchmarks, demonstrating that our proposed approach achieves strong performance while maintaining low communication costs.
Paperid:1829
Authors:Qianxiong Xu · Lanyun Zhu · Xuanyi Liu · Guosheng Lin · Cheng Long · Ziyue Li · Rui Zhao
Abstract:
Abstract:FewShot Segmentation (FSS) aims to learn class-agnostic segmentation on few classes to segment arbitrary classes, but at the risk of overfitting. To address this, some methods use the well-learned knowledge of foundation models (e.g., SAM) to simplify the learning process. Recently, SAM 2 has extended SAM by supporting video segmentation, whose class-agnostic matching ability is useful to FSS. A simple idea is to encode support foreground (FG) features as memory, with which query FG features are matched and fused. Unfortunately, the FG objects in different frames of SAM 2's video data are always the same identity, while those in FSS are different identities, i.e., the matching step is incompatible. Therefore, we design Pseudo Prompt Generator to encode pseudo query memory, matching with query features in a compatible way. However, the memories can never be as accurate as the real ones, i.e., they are likely to contain incomplete query FG, but some unexpected query background (BG) features, leading to wrong segmentation. Hence, we further design Iterative Memory Refinement to fuse more query FG features into the memory, and devise a Support-Calibrated Memory Attention to suppress the unexpected query BG features in memory. Extensive experiments have been conducted on PASCAL-5$^i$ and COCO-20$^i$ to validate the effectiveness of our design, e.g., the 1-shot mIoU can be 4.2\% better than the best baseline. The code will be released upon paper acceptance.
Paperid:1830
Authors:Lucy Farnik · Tim Lawson · Conor Houghton · Laurence Aitchison
Abstract:
Sparse autoencoders (SAEs) have been successfully used to discover sparse and humaninterpretable representations of the latent activations of language models (LLMs). However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand computations is unclear because they are not designed to “sparsify” computations in any sense, only latent activations. To solve this, we propose Jacobian sparse autoencoders (JSAEs), which yield not only sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them. With a naïve implementation, the Jacobians in LLMs would be computationally intractable due to their size. Our key technical contribution is thus finding an efficient way of computing Jacobians in this setup. We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs. We also show that JSAEs achieve a greater degree of computational sparsity on pre-trained LLMs than on the equivalent randomized LLM. This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might be more suitable for understanding learned transformer computations than standard SAEs.
Paperid:1831
Authors:Clément Pierquin · Aurélien Bellet · Marc Tommasi · Matthieu Boussard
Abstract:
Synthetic data inherits the differential privacy guarantees of the model used to generate it. Additionally, synthetic data may benefit fromprivacy amplificationwhen the generative model is kept hidden. Some empirical studies suggest this phenomenon, yet a rigorous theoretical understanding is lacking. In this paper, we investigate this question using the simple and wellunderstood framework of linear regression. Our findings show that the randomness given as input to the generative model does help to obfuscate private information. Conversely, we establish negative results when the adversary can control the seed of the generative model, identifying strategies to maximize privacy leakage from a single query to the model. We believe our results on linear regression can serve as a foundation for deriving more general bounds in the future.
Paperid:1832
Authors:Ye Wang · Sipeng Zheng · Bin Cao · Qianshan Wei · Weishuai Zeng · Qin Jin · Zongqing Lu
Abstract:
Abstract:Inspired by the recent success of LLMs, the field of human motion understanding has increasingly shifted toward developing large motion models.Despite some progress, current efforts remain far from achieving truly generalist models, primarily due to the lack of massive highquality data. To address this gap, we present MotionLib, the first million-level dataset for motion generation, which is at least 15$\times$ larger than existing counterparts and enriched with hierarchical text descriptions.Using MotionLib, we train a large motion model named Puppet, demonstrating robust performance across a wide range of human activities, including unseen ones.Through systematic investigation, for the first time, we highlight the importance of scaling both data and model size for advancing motion generation, along with key insights to achieve this goal.To better integrate the motion modality, we propose Motionbook, an innovative motion encoding approach including: (1) an compact yet lossless feature to represent motions; (2) a novel 2D lookup-free motion tokenizer which preserves fine-grained motion details while expanding codebook capacity, significantly enhancing the representational power of motion tokens.We believe this work lays the groundwork for developing more versatile and powerful motion generation models in the future.Code and database will be released at \url{https://anonymous.4open.science/r/puppet}.
Paperid:1833
Authors:Jifei Luo · Wenzheng Wu · Hantao Yao · Lu Yu · Changsheng Xu
Abstract:
Diffusionbased re-ranking methods are effective in modeling the data manifolds through similarity propagation in affinity graphs. However, positive signals tend to diminish over serval steps away from the source, reducing discriminative power beyond local regions. To address this issue, we introduce the Locality Preserving Markovian Transition (LPMT) framework, which employs a long-term thermodynamic transition process with multiple states for accurate manifold distance measurement. The proposed LPMT first integrates diffusion processes across separate graphs using Bidirectional Collaborative Diffusion (BCD) to establish strong similarity relationships. Afterwards, Locality State Embedding (LSE) encodes each instance into a distribution for enhanced local consistency. These distributions are interconnected via the Thermodynamic Markovian Transition (TMT) process, enabling efficient global retrieval while maintaining local effectiveness. Evaluations across various tasks verify the effectiveness of the LPMT for instance retrieval.
Paperid:1834
Authors:Alessandro Manenti · Daniele Zambon · Cesare Alippi
Abstract:
Graph neural networks use relational information as an inductive bias to enhance prediction performance. Not rarely, taskrelevant relations are unknown and graph structure learning approaches have been proposed to learn them from data. Given their latent nature, no graph observations are available to provide a direct training signal to the learnable relations. Therefore, graph topologies are typically learned on the prediction task alongside the other graph neural network parameters.In this paper, we demonstrate that minimizing point-prediction losses does not guarantee proper learning of the latent relational information and its associated uncertainty. Conversely, we prove that suitable loss functions on the stochastic model outputs simultaneously grant solving two tasks: (i) learning the unknown distribution of the latent graph and (ii) achieving optimal predictions of the model output. Finally, we propose a sampling-based method that solves this joint learning task. Empirical results validate our theoretical claims and demonstrate the effectiveness of the proposed approach.
Paperid:1835
Authors:Ruiyi Fang · Bingheng Li · Jingyu Zhao · Ruizhi Pu · QIUHAO Zeng · Gezheng Xu · Charles X. Ling · Boyu Wang
Abstract:
Graph Domain Adaptation (GDA) transfers knowledge from labeled source graphs to unlabeled target graphs, addressing the challenge of label scarcity. In this paper, we highlight the significance of graph homophily, a pivotal factor for graph domain alignment, which, however, has long been overlooked in existing approaches. Specifically, our analysis first reveals that homophily discrepancies exist in benchmarks. Moreover, we also show that homophily discrepancies degrade GDA performance from both empirical and theoretical aspects, which further underscores the importance of homophily alignment in GDA. Inspired by this finding, we propose a novel homophily alignment algorithm that employs mixed filters to smooth graph signals, thereby effectively capturing and mitigating homophily discrepancies between graphs. Experimental results on a variety of benchmarks verify the effectiveness of our method.
Paperid:1836
Authors:Mingkang Zhu · Xi Chen · Zhongdao Wang · Bei Yu · Hengshuang Zhao · Jiaya Jia
Abstract:
Recent advancements in reinforcement learning from human feedback have shown that utilizing finegrained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language models. However, it is challenging to leverage such learned token-level reward as guidance for Direct Preference Optimization (DPO), since DPO is formulated as a sentence-level bandit problem. To address this challenge, this work decomposes the sentence-level PPO into a sequence of token-level PPO problems, and then frames the problem of token-level PPO with token-level reward guidance, from which closed-form optimal token-level policy and the corresponding token-level reward can be derived. Using the obtained reward and Bradley-Terry model, this work develops a loss function with token-level reward guidance for DPO, and proposes a practical reward guidance based on the induced DPO reward. This formulation enables different tokens to exhibit varying degrees of deviation from reference policy based on their respective rewards, and facilitates the discovery of satisfactory policies upon loss convergence, a property that is rarely observed in conventional preference optimization methods. Experiment results demonstrate that our method achieves substantial performance improvements over DPO, with win rate gains of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard.
Paperid:1837
Authors:Yeqing Qiu · Ye XUE · Akang Wang · Yiheng Wang · Qingjiang Shi · Zhiquan Luo
Abstract:
Abstract:The Max$k$-Cut problem is a fundamental combinatorial optimization challenge that generalizes the classic $\mathcal{NP}$-complete Max-Cut problem. While relaxation techniques are commonly employed to tackle Max-$k$-Cut, they often lack guarantees of equivalence between the solutions of the original problem and its relaxation. To address this issue, we introduce the Relax-Optimize-and-Sample (ROS) framework. In particular, we begin by relaxing the discrete constraints to the continuous probability simplex form. Next, we pre-train and fine-tune a graph neural network model to efficiently optimize the relaxed problem. Subsequently, we propose a sampling-based construction algorithm to map the continuous solution back to a high-quality Max-$k$-Cut solution. By integrating geometric landscape analysis with statistical theory, we establish the consistency of function values between the continuous solution and its mapped counterpart. Extensive experimental results on random regular graphs and the Gset benchmark demonstrate that the proposed ROS framework effectively scales to large instances with up to $20,000$ nodes in just a few seconds, outperforming state-of-the-art algorithms. Furthermore, ROS exhibits strong generalization capabilities across both in-distribution and out-of-distribution instances, underscoring its effectiveness for large-scale optimization tasks.
Paperid:1838
Authors:Zhang Jiasheng · Delvin Zhang · Shuang Liang · Zhengpin Li · ZHITAO YING · Jie Shao
Abstract:
Protein language models often struggle to capture biological functions due to their lack of factual knowledge (e.g., gene descriptions). Existing solutions leverage protein knowledge graphs (PKGs) as auxiliary pretraining objectives, but lack explicit integration of task-oriented knowledge, making them suffer from limited knowledge exploitation and catastrophic forgetting. The root cause is that they fail to align PKGs with task-specific data, forcing their knowledge modeling to adapt to the knowledge-isolated nature of downstream tasks. In this paper, we propose Knowledge-aware retrieval augmented protein language model (Kara), achieving the first task-oriented and explicit integration of PKGs and protein language models. With a knowledge retriever learning to predict linkages between PKG and task proteins, Kara unifies the knowledge integration of the pre-training and fine-tuning stages with a structure-based regularization, mitigating catastrophic forgetting. To ensure task-oriented integration, Kara uses contextualized virtual tokens to extract graph context as task-specific knowledge for new proteins. Experiments show that Kara outperforms existing knowledge-enhanced models in 6 representative tasks, achieving on average 5.1% improvements.
Paperid:1839
Authors:Sushant Agarwal · Amit Jayant Deshpande · Rajmohan Rajaraman · Ravi Sundaram
Abstract:
Previous work in fair machine learning has characterised the Fair Bayes Optimal Classifier (BOC) on a given distribution for both deterministic and randomized classifiers. We study the robustness of the Fair BOC to adversarial noise in the data distribution. Kearns & Li (1988) implies that the accuracy of the deterministic BOC without any fairness constraints is robust (Lipschitz) to malicious noise in the data distribution. We demonstrate that their robustness guarantee breaks down when we add fairness constraints. Hence, we consider the randomized Fair BOC, and our central result is that its accuracy is robust to malicious noise in the data distribution. Our robustness result applies to various fairness constraintsDemographic Parity, Equal Opportunity, Predictive Equality. Beyond robustness, we demonstrate that randomization leads to better accuracy and efficiency. We show that the randomized Fair BOC is nearly-deterministic, and gives randomized predictions on at most one data point, hence availing numerous benefits of randomness, while using very little of it.
Paperid:1840
Authors:Suho Shin · Chenghao Yang · Haifeng Xu · MohammadTaghi Hajiaghayi
Abstract:
Abstract:We introduce the tokenized linear bandit (TLB) and multiarmed bandit (TMAB), variants of linear and stochastic multi-armed bandit problems inspired by LLM decoding and alignment. In these problems, at each round $t \in [T]$, a user submits a query (context), and the decision maker (DM) sequentially selects a token irrevocably from a token set. Once the sequence is complete, the DM observes a random utility from the user, whose expectation is presented by a sequence function mapping the chosen token sequence to a nonnegative real value that depends on the query.In both problems, we first show that learning is impossible without any structure on the sequence function.We introduce a natural assumption, diminishing distance with more commons (DDMC), and propose algorithms with regret $\tilde{O}(L\sqrt{T})$ and $\tilde{O}(L\sqrt{T^{2/3}})$ for TLB and TMAB, respectively.As a side product, we obtain an optimality of the greedy decoding for LLM decoding algorithm under DDMC, which justifies the unresaonable effectiveness of greedy in several tasks.This also has an immediate application to decoding-time LLM alignment, when the misaligned utility can be represented as the frozen LLM's utility and a linearly realizable latent function.Though our primary focus in this paper is the theory of TLB and TMAB, we also validate our DDMC assumption using Llama-3-8B instruct model on TruthfulQA and HH-RLHF datasets.
Paperid:1841
Authors:Andreas Kontogiannis · Konstantinos Papathanasiou · Yi Shen · Giorgos Stamou · Michael Zavlanos · George Vouros
Abstract:
Abstract:Learning to cooperate in distributed partially observable environments with no communication abilities poses significant challenges for multiagent deep reinforcement learning (MARL). This paper addresses key concerns in this domain, focusing on inferring state representations from individual agent observations and leveraging these representations to enhance agents' exploration and collaborative task execution policies. To this end, we propose a novel state modelling framework for cooperative MARL, where agents infer meaningful belief representations of the non-observable state, with respect to optimizing their own policies, while filtering redundant and less informative joint state information. Building upon this framework, we propose the MARL SMPE$^2$ algorithm. In SMPE$^2$, agents enhance their own policy's discriminative abilities under partial observability, explicitly by incorporating their beliefs into the policy network, and implicitly by adopting an adversarial type of exploration policies which encourages agents to discover novel, high-value states while improving the discriminative abilities of others. Experimentally, we show that SMPE$^2$ outperforms a plethora of state-of-the-art MARL algorithms in complex fully cooperative tasks from the MPE, LBF, and RWARE benchmarks.
Paperid:1842
Authors:Haoyuan Cai · Zhenghao Peng · Bolei Zhou
Abstract:
Interactive Imitation Learning (IIL) allows agents to acquire desired behaviors through human interventions, but current methods impose high cognitive demands on human supervisors. We propose the Adaptive Intervention Mechanism (AIM), a novel robotgated IIL algorithm that learns an adaptive criterion for requesting human demonstrations. AIM utilizes a proxy Q-function to mimic human's intervention rule and adjusts intervention requests based on the alignment between agent and human actions. Our expert-in-the-loop experiments reveal that AIM significantly reduces expert monitoring efforts in both continuous and discrete control tasks. Furthermore, AIM effectively targets safety-critical states for expert assistance, facilitating the acquisition of expert intervention strategies with reduced expert data and environmental interactions.
Paperid:1843
Authors:Antoine Gonon · Nicolas Brisebarre · Elisa Riccietti · Rémi Gribonval
Abstract:
Lipschitz bounds on neural network parameterizations are important to establish generalization, quantization or pruning guarantees, as they control the robustness of the network with respect to parameter changes. Yet, there are few Lipschitz boundswith respect to parametersin the literature, and existing ones only apply to simple feedforward architectures, while also failing to capture the intrinsic rescalingsymmetries of ReLU networks. This paper proves a new Lipschitz bound in terms of the so-called path-metrics of the parameters. Since this bound is intrinsically invariant with respect to the rescaling symmetries of the networks, it sharpens previously known Lipschitz bounds. It is also, to the best of our knowledge, the first bound of its kind that is broadly applicable to modern networks such as ResNets, VGGs, U-nets, and many more.
Paperid:1844
Authors:Xianghua Zeng · Hang Su · Zhengyi Wang · Zhiyuan LIN
Abstract:
Offline multiagent reinforcement learning (MARL) struggles to estimate out-of-distribution states and actions due to the absence of real-time environmental feedback. While diffusion models show promise in addressing these challenges, their application primarily focuses on independently diffusing the historical trajectories of individual agents, neglecting crucial multi-agent coordination dynamics and reducing policy robustness in dynamic environments. In this paper, we propose MCGD, a novel Multi-agent Coordination framework based on Graph Diffusion models to improve the effectiveness and robustness of collaborative policies. Specifically, we begin by constructing a sparse coordination graph that includes continuous node attributes and discrete edge attributes to effectively identify the underlying dynamics of multi-agent interactions. Next, we derive transition probabilities between edge categories and present adaptive categorical diffusion to capture the structure diversity of multi-agent coordination. Leveraging this coordination structure, we define neighbor-dependent forward noise and develop anisotropic diffusion to enhance the action diversity of each agent. Extensive experiments across various multi-agent environments demonstrate that MCGD significantly outperforms existing state-of-the-art baselines in coordination performance and policy robustness in dynamic environments.
Paperid:1845
Authors:Jing Ma · Hanlin Li · Xiang
Abstract:
TestTime Adaptation (TTA) enables deep neural networks to adapt to arbitrary distributions during inference. Existing TTA algorithms generally tend to select benign samples that help achieve robust online prediction and stable self-training. Although malicious samples that would undermine the model's optimization should be filtered out, it also leads to a waste of test data. To alleviate this issue, we focus on how to make full use of the malicious test samples for TTA by transforming them into benign ones, and propose a plug-and-play method, PTTA. The core of our solution lies in the purification strategy, which retrieves benign samples having opposite effects on the objective function to perform Mixup with malicious samples, based on a saliency indicator for encoding benign and malicious data. This strategy results in effective utilization of the information in malicious samples and an improvement of the models' online test accuracy. In this way, we can directly apply the purification loss to existing TTA algorithms without the need to carefully adjust the sample selection threshold. Extensive experiments on four types of TTA tasks and classification, segmentation, and adversarial defense demonstrate the effectiveness of our method.
Paperid:1846
Authors:Parshin Shojaee · Ngoc Hieu Nguyen · Kazem Meidani · Amir Barati Farimani · Khoa Doan · Chandan Reddy
Abstract:
Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect actual discovery. In this paper, we introduce LLMSRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorization, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods on LLM-SRBench, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy.These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.
Paperid:1847
Authors:Jiecheng Lu · Shihao Yang
Abstract:
Autoregressive attentionbased time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention often outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.
Paperid:1848
Authors:Thanh Tran · Viet Hoang Tran · Thanh Chu · Trang Pham · Laurent Ghaoui · Tam Le · Tan Nguyen
Abstract:
TreeSliced methods have recently emerged as an alternative to the traditional Sliced Wasserstein (SW) distance, replacing one-dimensional lines with tree-based metric spaces and incorporating a splitting mechanism for projecting measures. This approach enhances the ability to capture the topological structures of integration domains in Sliced Optimal Transport while maintaining low computational costs. Building on this foundation, we propose a novel nonlinear projectional framework for the Tree-Sliced Wasserstein (TSW) distance, substituting the linear projections in earlier versions with general projections, while ensuring the injectivity of the associated Radon Transform and preserving the well-definedness of the resulting metric. By designing appropriate projections, we construct efficient metrics for measures on both Euclidean spaces and spheres. Finally, we validate our proposed metric through extensive numerical experiments for Euclidean and spherical datasets. Applications include gradient flows, self-supervised learning, and generative models, where our methods demonstrate significant improvements over recent SW and TSW variants.
Paperid:1849
Authors:Sirui Lin · Zijun Gao · Jose Blanchet · Peter Glynn
Abstract:
Causal estimands can vary significantly depending on the relationship between outcomes in treatment and control groups, leading to wide partial identification (PI) intervals that impede decision making. Incorporating covariates can substantially tighten these bounds, but requires determining the range of PI over probability models consistent with the joint distributions of observed covariates and outcomes in treatment and control groups. This problem is known to be equivalent to a conditional optimal transport (COT) optimization task, which is more challenging than standard optimal transport (OT) due to the additional conditioning constraints. In this work, we study a tight relaxation of COT that effectively reduces it to standard OT, leveraging its wellestablished computational and theoretical foundations. Our relaxation incorporates covariate information and ensures narrower PI intervals for any value of the penalty parameter, while becoming asymptotically exact as a penalty increases to infinity. This approach preserves the benefits of covariate adjustment in PI and results in a data-driven estimator for the PI set that is easy to implement using existing OT packages. We analyze the convergence rate of our estimator and demonstrate the effectiveness of our approach through extensive simulations, highlighting its practical use and superior performance compared to existing methods.
Paperid:1850
Authors:Shu-wen Yang · Byeonggeun Kim · Kuan Po Huang · Qingming Tang · Huy Phan · Bo-Ru Lu · Harshavardhan Sundar · Shalini Ghosh · Hung-yi Lee · Chieh-Chi Kao · Chao Wang
Abstract:
Autoregressive nexttoken prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters—193M for our Base and 462M for our Large models.
Paperid:1851
Authors:Yuanchao Xu · Kaidi Shao · Nikos Logothetis · Zhongwei Shen
Abstract:
Analyzing longterm behaviors in high-dimensional nonlinear dynamical systems remains challenging, with the Koopman operator framework providing a powerful global linearization approach, though existing methods for approximating its spectral components often suffer from theoretical limitations and reliance on predefined dictionaries. While Residual Dynamic Mode Decomposition (ResDMD) introduced the spectral residual to assess the accuracy of Koopman operator approximation, its only filters precomputed spectra, which prevents it from fully discovering the Koopman operator's complete spectral information (a limitation sometimes referred to as the 'spectral inclusion' problem). We introduce ResKoopNet (Residual-based Koopman-learning Network), a novel method that addresses this limitation by explicitly minimizing the spectral residual to compute Koopman eigenpairs, which can identify a more precise and complete spectrum of the Koopman operator. This approach provides theoretical guarantees while maintaining computational adaptability through a neural network implementation. Experiments on physical and biological systems demonstrate ResKoopNet's superior accuracy in spectral approximation compared to existing methods, particularly for systems with continuous spectra and high dimensional, which makes it as an effective tool for analyzing complex dynamical systems.
Paperid:1852
Authors:Ravi Ghadia · Avinash Kumar · Gaurav Jain · Prashant J. Nair · Poulami Das
Abstract:
Abstract:Autoregressive Transformers rely on keyvalue (KV) caching to accelerate inference. However, the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints. This bottleneck is particularly problematic in real-time applications -- such as chatbots and interactive assistants -- where low latency and memory efficiency are critical. Existing methods truncate distant tokens or compress states in a lossy manner, sacrificing accuracy by discarding vital context or introducing bias.We propose ${MorphKV}$, an inference-time technique that maintains a constant-sized KV cache while preserving model accuracy. ${MorphKV}$ balances long-range dependencies and local coherence. It eliminates early-token bias while retaining high-fidelity context by dynamically prioritizing tokens through correlation-aware selection. Unlike heuristic retention (e.g., sliding windows) or lossy compression, ${MorphKV}$ iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens. This is crucial for tasks like content creation and code generation. Our studies using Llama3.1-8B-Instruct show $\textbf{86.5}$% memory savings and $\textbf{13.5}$% higher accuracy than prior methods, enabling real-world efficient deployment.
Paperid:1853
Authors:Viet Hoang Tran · Trang Pham · Tho Tran Huu · Minh-Khoi Nguyen-Nhat · Thanh Chu · Tam Le · Tan Nguyen
Abstract:
Many variants of Optimal Transport (OT) have been developed to address its heavy computation. Among them, notably, Sliced Wasserstein (SW) is widely used for application domains by projecting the OT problem onto onedimensional lines, and leveraging the closed-form expression of the univariate OT to reduce the computational burden. However, projecting measures onto low-dimensional spaces can lead to a loss of topological information. To mitigate this issue, in this work, we propose to replace one-dimensional lines with a more intricate structure, called \emph{tree systems}. This structure is metrizable by a tree metric, which yields a closed-form expression for OT problems on tree systems. We provide an extensive theoretical analysis to formally define tree systems with their topological properties, introduce the concept of splitting maps, which operate as the projection mechanism onto these structures, then finally propose a novel variant of Radon transform for tree systems and verify its injectivity. This framework leads to an efficient metric between measures, termed Tree-Sliced Wasserstein distance on Systems of Lines (TSW-SL). By conducting a variety of experiments on gradient flows, image style transfer, and generative models, we illustrate that our proposed approach performs favorably compared to SW and its variants.
Paperid:1854
Authors:Kenneth Li · Yida Chen · Fernanda Viégas · Martin Wattenberg
Abstract:
In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we challenge the notion of ``quality'' in the context of posttraining. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
Paperid:1855
Authors:Manan Tayal · Aditya Singh · Shishir Nadubettu Yadukumar · Somil Bansal
Abstract:
As autonomous systems become more ubiquitous in daily life, ensuring high performance with guaranteed safety is crucial. However, safety and performance could be competing objectives, which makes their cooptimization difficult. Learning-based methods, such as Constrained Reinforcement Learning (CRL), achieve strong performance but lack formal safety guarantees due to safety being enforced as soft constraints, limiting their use in safety-critical settings. Conversely, formal methods such as Hamilton-Jacobi (HJ) Reachability Analysis and Control Barrier Functions (CBFs) provide rigorous safety assurances but often neglect performance, resulting in overly conservative controllers. To bridge this gap, we formulate the co-optimization of safety and performance as a state-constrained optimal control problem, where performance objectives are encoded via a cost function and safety requirements are imposed as state constraints. We demonstrate that the resultant value function satisfies a Hamilton-Jacobi-Bellman (HJB) equation, which we approximate efficiently using a novel physics-informed machine learning framework. In addition, we introduce a conformal prediction-based verification strategy to quantify the learning errors, recovering a high-confidence safety value function, along with a probabilistic error bound on performance degradation. Through several case studies, we demonstrate the efficacy of the proposed framework in enabling scalable learning of safe and performant controllers for complex, high-dimensional autonomous systems.
Paperid:1856
Authors:Xu Liu · Juncheng Liu · Gerald Woo · Taha Aksu · Yuxuan Liang · Roger Zimmermann · Chenghao Liu · Junnan Li · Silvio Savarese · Caiming Xiong · Doyen Sahoo
Abstract:
Achieving effective unified pretraining on large time series corpora remains an open challenge in developing time series foundation models. Existing foundation models, such as Moirai, introduce multiple projection layers for time series of different frequencies to account for high data heterogeneity. We identify major drawbacks to this humanimposed frequency-level model specialization. First, frequency is not a reliable indicator for grouping pretraining data. For example, time series with different frequencies can display similar patterns, while those with the same frequency may exhibit varied patterns. Furthermore, time series can display varied distributions even within a short window. Frequency-level specialization overlooks the diversity at this granularity. To address these issues, this paper introduces Moirai-MoE, excluding human-defined data groupings while delegating the modeling of diverse time series patterns to the sparse mixture of experts (MoE) within Transformers. With this design, Moirai-MoE eliminates reliance on human-defined heuristics and enables automatic token-level specialization. Extensive evaluations on 39 datasets demonstrate the superiority of Moirai-MoE over state-of-the-art baselines. This study also conducts comprehensive model analyses to explore the inner workings of time series MoE foundation models and provides insights for future research.
Paperid:1857
Authors:Simone Drago · Marco Mussi · Alberto Maria Metelli
Abstract:
Abstract:The success of sequential decisionmaking approaches, such as *reinforcement learning* (RL), is closely tied to the availability of a reward feedback. However, designing a reward function that encodes the desired objective is a challenging task. In this work, we address a more realistic scenario: sequential decision-making with preference feedback provided, for instance, by a human expert. We aim to build a theoretical basis linking *preferences*, *utilities* (i.e., non-Markovian rewards), and Markovian *rewards*, and we study the connections between them. First, we model preference feedback using a partial (pre)order over trajectories, enabling the presence of incomparabilities that are common when preferences are provided by humans but are surprisingly overlooked in existing works. Second, to provide a theoretical justification for a common practice, we investigate how a preference relation can be approximated by a multi-objective utility. We introduce a notion of preference-utility compatibility and analyze the computational complexity of this transformation, showing that constructing the minimum-dimensional utility is $\textsf{NP}$-hard. Third, we propose a novel concept of preference-based policy dominance that does not rely on utilities or rewards and analyze the computational complexity of assessing it. Fourth, we develop a computationally efficient algorithm to approximate a utility using (Markovian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision-making from preference feedback, with promising potential applications in RL from human feedback.
Paperid:1858
Authors:R Majumdar · Seyed Mahmoud Salamati · Sadegh Soudjani
Abstract:
Learning to control an unknown dynamical system with respect to highlevel temporal specifications is an important problem in control theory. We present the first regret-free online algorithm for learning a controller for linear temporal logic (LTL) specifications for systems with unknown dynamics. We assume that the underlying (unknown) dynamics is modeled by a finite-state and action Markov decision process (MDPs). Our core technical result is a regret-free learning algorithm for infinite-horizon reach-avoid problems on MDPs.For general LTL specifications, we show that the synthesis problem can be reduced to a reach-avoid problem once the graph structure is known. Additionally, we provide an algorithm for learning the graph structure, assuming knowledge of a minimum transition probability, which operates independently of the main regret-free algorithm.Our LTL controller synthesis algorithm provides sharp bounds on how close we are to achieving optimal behavior after a finite number of learning episodes. In contrast, previous algorithms for LTL synthesis only provide asymptotic guarantees, which provide no insight into the transient performance during the learning phase.
Paperid:1859
Authors:Rashid Mushkani · Perampalli Shravan Nayak · Hugo Berard · Allison Cohen · Shin Koseki · Hadrien Bertrand
Abstract:
We introduce the Local Intersectional Visual Spaces (LIVS) dataset, a benchmark for multicriteria alignment of text-to-image (T2I) models in inclusive urban planning. Developed through a two-year participatory process with 30 community organizations, LIVS encodes diverse spatial preferences across 634 initial concepts, consolidated into six core criteria—Accessibility, Safety, Comfort, Invitingness, Inclusivity, and Diversity—through 37,710 pairwise comparisons. Using Direct Preference Optimization (DPO) to fine-tune Stable Diffusion XL, we observed a measurable increase in alignment with community preferences, though a significant proportion of neutral ratings highlights the complexity of modeling intersectional needs. Additionally, as annotation volume increases, accuracy shifts further toward the DPO-tuned model, suggesting that larger-scale preference data enhances fine-tuning effectiveness. LIVS underscores the necessity of integrating context-specific, stakeholder-driven criteria into generative modeling and provides a resource for evaluating AI alignment methodologies across diverse socio-spatial contexts.
Paperid:1860
Authors:JINHAO LIANG · Jacob Christopher · Sven Koenig · Ferdinando Fioretto
Abstract:
Recent advances in diffusion models hold significant potential in robotics, enabling the generation of diverse and smooth trajectories directly from raw representations of the environment. Despite this promise, applying diffusion models to motion planning remains challenging due to their difficulty in enforcing critical constraints, such as collision avoidance and kinematic feasibility. These limitations become even more pronounced in MultiRobot Motion Planning (MRMP), where multiple robots must coordinate in shared spaces. To address this challenge, this work proposesSimultaneousMRMPDiffusion (SMD), a novel approach integrating constrained optimization into the diffusion sampling process to produce collision-free, kinematically feasible trajectories. Additionally, the paper introduces a comprehensive MRMP benchmark to evaluate trajectory planning algorithms across scenarios with varying robot densities, obstacle complexities, and motion constraints. Experimental results show SMD consistently outperforms classical and learning-based motion planners, achieving higher success rates and efficiency in complex multi-robot environments.
Paperid:1861
Authors:Han Wang · Yang Xu · Wenbin Lu · Rui Song
Abstract:
OffPolicy Evaluation (OPE) aims to estimate the value of a target policy using offline data collected from potentially different policies. In real-world applications, however, logged data often suffers from missingness. While OPE has been extensively studied in the literature, a theoretical understanding of how missing data affects OPE results remains unclear. In this paper, we investigate OPE in the presence of monotone missingness and theoretically demonstrate that the value estimates remain unbiased under ignorable missingness but can be biased under nonignorable (informative) missingness. To retain the consistency of value estimation, we propose an inverse probability weighting value estimator and conduct statistical inference to quantify the uncertainty of the estimates. Through a series of numerical experiments, we empirically demonstrate that our proposed estimator yields a more reliable value inference under missing data.
Paperid:1862
Authors:Seth Karten · Andy Nguyen · Chi Jin
Abstract:
We introduce \texttt{Pok\'eChamp}, a Large Language Model (LLM) powered gametheoretic agent for two-player competitive Pok\'emon battles that demonstrates the potential for online self-improvement in AI systems. Inspired by the concept of AI takeoff, \texttt{Pok\'eChamp} uses an LLM prior and high-Elo human data to model minimax search without additional training, simulating a form of self-play that could lead to rapid capability gains.\texttt{Pok\'eChamp} uses a depth-limited minimax search online where the LLM replaces three key components: 1) action sampling from the LLM guided by prompts (including a one-step lookahead feature), 2) opponent-modeling via the historical likelihood of actions from our dataset to model the effect of LLM-predicted opponent actions, and 3) state value calculation for the LLM to reflect on each intrinsic state. \texttt{Pok\'eChamp} outperforms all existing LLM-based (76\%) and rule-based bots (84\%) by an enormous margin, including winning consistently (64\%) against prior human-parity work run with a frontier model, GPT 4-o, while ours uses an open-source 8 billion parameter Llama 3.1 model. \texttt{Pok\'eChamp} achieves expert performance in the top 10\% of players on the online ladder against competitive human players at an Elo of 1500. Finally, we collect the largest Pok\'emon battling dataset, including 1 million+ games with 150k+ high Elo games, prepare a series of battling benchmarks based on real player data and puzzles to analyze specific battling abilities, and provide crucial updates to the local game engine, providing a foundation for future research on AI self-play and capability amplification in complex, partially observable environments. Our code is available \href{https://sites.google.com/view/pokechamp-llm}{online}.
Paperid:1863
Authors:Lukas Fluri · Leon Lang · Alessandro Abate · Patrick Forré · David Krueger · Joar Skalse
Abstract:
In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an errorregret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.
Paperid:1864
Authors:Qunzhong WANG · Xiangguo Sun · Hong Cheng
Abstract:
In recent years, graph prompting has emerged as a promising research direction, enabling the learning of additional tokens or subgraphs appended to original graphs without requiring retraining of pretrained graph models across various applications. This novel paradigm, shifting from the traditional "pre-training and fine-tuning" to "pre-training and prompting," has shown significant empirical success in simulating graph data operations, with applications ranging from recommendation systems to biological networks and graph transferring. However, despite its potential, the theoretical underpinnings of graph prompting remain underexplored, raising critical questions about its fundamental effectiveness. The lack of rigorous theoretical proof of why and how much it works is more like a "dark cloud" over the graph prompting area for deeper research. To fill this gap, this paper introduces a theoretical framework that rigorously analyzes graph prompting from a data operation perspective. Our contributions are threefold:First, we provide a formal guarantee theorem, demonstrating graph prompts’ capacity to approximate graph transformation operators, effectively linking upstream and downstream tasks.Second, we derive upper bounds on the error of these data operations for a single graph and extend this discussion to batches of graphs, which are common in graph model training.Third, we analyze the distribution of data operation errors, extending our theoretical findings from linear graph models (e.g., GCN) to non-linear graph models (e.g., GAT). Extensive experiments support our theoretical results and confirm the practical implications of these guarantees.
Paperid:1865
Authors:Atsushi Nitanda · Anzelle Lee · Damian Kai · Mizuki Sakaguchi · Taiji Suzuki
Abstract:
Meanfield Langevin dynamics (MFLD) is an optimization method derived by taking the mean-field limit of noisy gradient descent for two-layer neural networks in the mean-field regime. Recently, the propagation of chaos (PoC) for MFLD has gained attention as it provides a quantitative characterization of the optimization complexity in terms of the number of particles and iterations. A remarkable progress by Chen et al. (2022) showed that the approximation error due to finite particles remains uniform in time and diminishes as the number of particles increases. In this paper, by refining the defective log-Sobolev inequality---a key result from that earlier work---under the neural network training setting, we establish an improved PoC result for MFLD, which removes the exponential dependence on the regularization coefficient from the particle approximation term of the optimization complexity. As an application, we propose a PoC-based model ensemble strategy with theoretical guarantees.
Paperid:1866
Authors:Chen Wei · Chi Zhang · Jiachen Zou · Haotian Deng · Dietmar Heinke · Quanying Liu
Abstract:
Human decisionmaking in cognitive tasks and daily life exhibits considerable variability, shaped by factors such as task difficulty, individual preferences, and personal experiences. Understanding this variability across individuals is essential for uncovering the perceptual and decision-making mechanisms that humans rely on when faced with uncertainty and ambiguity. We present a computational framework that combines perceptual boundary sampling in ANNs and human behavioral experiments to systematically investigate this phenomenon. Our perceptual boundary sampling algorithm generates stimuli along ANN decision boundaries that intrinsically induce significant perceptual variability. The efficacy of these stimuli is empirically validated through large-scale behavioral experiments involving 246 participants across 116,715 trials, culminating in the varMNIST dataset containing 19,943 systematically annotated images. Through personalized model alignment and adversarial generation, we establish a reliable method for simultaneously predicting and manipulating the divergent perceptual decisions of pairs of participants. This work bridges the gap between computational models and human individual difference research, providing new tools for personalized perception analysis.
Paperid:1867
Authors:Xize Liang · Lin Yang · Jie Wang · Yiyang Lu · Runyu Wu · Hanzhu Chen · Jianye Hao
Abstract:
The multidomain fine-tuning of large language models (LLMs) confronts a notorious trade-off among abilities across domains. Existing studies attribute this trade-off to the conflicts between samples rooted in inherent semantics. Recent approaches attempt to mitigate these conflicts through the empirical investigation or heuristic strategies. However, without a fundamental understanding of interactions between samples, they yield only marginal improvements, while incurring substantial trial-and-error costs. To address this challenge, we move beyond empirical studies by modeling interactions between samples as their influence on each other's loss, estimated using gradients. Intriguingly, we find that these interactionsevolve throughout trainingrather than being purely determined by inherent semantics. Building on this insight, we proposeEVolvingInteraction-guidedCurriculum (EVIC), which iteratively selects samples that positively influence the overall dataset for training. By dynamically adapting the training curriculum to prioritize samples that contribute the most to the model training, EVIC effectively mitigates conflicts and improves the sample efficiency. Extensive experiments on a mixed dataset covering coding, math, and general tasks with several model architectures show that EVIC significantly outperforms all baselines across diverse capabilities.
Paperid:1868
Authors:Lucas Rosenblatt · Yuliia Lut · Ethan Turok · Marco Medina · Rachel Cummings
Abstract:
Imbalanced learning occurs in classification settings where the distribution of classlabels is highly skewed in the training data, such as when predicting rare diseases or in fraud detection. This class imbalance presents a significant algorithmic challenge, which can be further exacerbated when privacy-preserving techniques such as differential privacy are applied to protect sensitive training data. Our work formalizes these challenges and provides a number of algorithmic solutions. We consider DP variants of pre-processing methods that privately augment the original dataset to reduce the class imbalance, alongside DP variants of in-processing techniques, which adjust the learning algorithm to account for the imbalance. For each method, we either adapt an existing imbalanced learning technique to the private setting or demonstrate its incompatibility with differential privacy. Finally, we empirically evaluate these privacy-preserving imbalanced learning methods under various data and distributional settings. We find that private synthetic data methods perform well as a data pre-processing step, while class-weighted ERMs are an alternative in higher-dimensional settings where private synthetic data suffers from the curse of dimensionality.
Paperid:1869
Authors:Tuan Truong · Chau Nguyen · Huy Nguyen · Minh Le · Trung Le · Nhat Ho
Abstract:
Lowrank Adaptation (LoRA) has emerged as a powerful and efficient method for fine-tuning large-scale foundation models. Despite its popularity, the theoretical understanding of LoRA has remained underexplored. In this paper, we present a theoretical analysis of LoRA by examining its connection to the Mixture of Experts models. Under this framework, we show that a simple technique, reparameterizing LoRA matrices, can notably accelerate the low-rank matrix estimation process. In particular, we prove that reparameterization can reduce the data needed to achieve a desired estimation error from an exponential to a polynomial scale. Motivated by this insight, we proposeReparameterizedLow-RankAdaptation (RepLoRA), incorporating a lightweight MLP to reparameterize the LoRA matrices. Extensive experiments across multiple domains demonstrate that RepLoRA consistently outperforms vanilla LoRA. With limited data, RepLoRA surpasses LoRA by a substantial margin of up to40.0%and achieves LoRA's performance using only30.0%of the training data, highlighting the theoretical and empirical robustness of our PEFT method.
Paperid:1870
Authors:Woohyeon Park · Woojin Kim · Jaeik Kim · Jaeyoung Do
Abstract:
Despite significant advancements in Large VisionLanguage Models (LVLMs), the performance of existing LVLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables LVLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in LVLMs, showing that prioritizing and contrasting across scales outperforms existing methods.
Paperid:1871
Authors:Thibaut Germain · Chrysoula Kosma · Laurent Oudre
Abstract:
Automatically extracting robust representations from large and complex time series data is becoming imperative for several realworld applications. Unfortunately, the potential of common neural network architectures in capturing invariant properties of time series remains relatively underexplored. For instance, convolutional layers often fail to capture underlying patterns in time series inputs that encompass strong deformations, such as trends. Indeed, invariances to some deformations may be critical for solving complex time series tasks, such as classification, while guaranteeing good generalization performance.To address these challenges, we mathematically formulate and technically design efficient and hard-codedinvariant convolutionsfor specific group actions applicable to the case of time series.We construct these convolutions by considering specific sets of deformations commonly observed in time series, includingscaling,offset shift, andtrend.We further combine the proposed invariant convolutions with standard convolutions in single embedding layers, and we showcase the layer capacity to capture complex invariant time series properties in several scenarios.
Paperid:1872
Authors:Yuntao Wang · Yuxuan Li · Qingyuan Yang · Hu Ding
Abstract:
Abstract:Wasserstein Barycenter (WB) is a fundamental geometric optimization problem in machine learning, whose objective is to find a representative probability measure that minimizes the sum of Wasserstein distances to given distributions. WB has a number of applications in various areas. However, WB may lead to unfair outcome towards underrepresented groups in some applications (e.g., a "minority'' distribution may be far away from the obtained WB under Wasserstein distance). To address this issue, we propose an alternative objective called "Wasserstein Ball Center (WBC)''. Specifically, WBC is a distribution that encompasses all input distributions within the minimum Wasserstein distance, which can be formulated as a ``minmax'' optimization problem. We show that the WBC problem with fixed support is equivalent to solving a largescale linear programming (LP) instance, which is quite different from the previously studied LP model for WB. By incorporating some novel observations on the induced normal equation, we propose an efficient algorithm that accelerates the interior point method by $O(\min(N^2m, Nm^2, m^4))$ times ("$N$'' is the number of distributions and "$m$'' is the support size). Finally, we conduct a set of experiments on both synthetic and real-world datasets, demonstrating the computational efficiency of our algorithm, and showing its ability to provide more fairness for input distributions.
Paperid:1873
Authors:Chenhao Sun · Yuhao Mao · Mark Müller · Martin Vechev
Abstract:
Randomized smoothing is a popular approach for providing certified robustness guarantees against adversarial attacks, and has become an active area of research. Over the past years, the average certified radius (ACR) has emerged as the most important metric for comparing methods and tracking progress in the field. However, in this work, for the first time we show that ACR is a poor metric for evaluating robustness guarantees provided by randomized smoothing. We theoretically prove not only that a trivial classifier can have arbitrarily large ACR, but also that ACR is much more sensitive to improvements on easy samples than on hard ones. Empirically, we confirm that existing training strategies, though improving ACR with different approaches, reduce the model's robustness on hard samples consistently. To strengthen our conclusion, we propose strategies, including explicitly discarding hard samples, reweighting the dataset with approximate certified radius, and extreme optimization for easy samples, to achieve stateof-the-art ACR, without training for robustness on the full data distribution. Overall, our results suggest that ACR has introduced a strong undesired bias to the field, and its application should be discontinued when evaluating randomized smoothing.
Paperid:1874
Authors:Yao Zhu · Zhenyuan LI · yangyang wei · Shouling Ji
Abstract:
Provenancebased intrusion detection systems (PIDSes) have emerged as a vital defense against complex and covert cyberattacks. Provenance graphs, notable for their scale, semantic depth, and dense connectivity, pose significant challenges for storage, representation, and analysis. These factors limit the effectiveness of machine learning models like Graph Neural Networks (GNNs) in processing and learning from these graphs.This paper introduces a learning-based method to efficiently embed and analyze large-scale provenance graphs. Our approach integrates dynamic graph processing with adaptive encoding, enabling compact embeddings that tackle out-of-vocabulary (OOV) elements and normality shifts in dynamic environments. We incorporate this baseline into a tag-propagation framework for real-time detection, demonstrating our method's accuracy and adaptability in anomaly path mining, and advancing the state-of-the-art in provenance graph handling and anomaly detection.
Paperid:1875
Authors:Xiaolong Xu · Yibo Zhou · Haolong Xiang · Xiaoyong Li · Xuyun Zhang · Lianyong Qi · Wanchun Dou
Abstract:
Documentlevel relation extraction (RE) aims to extract comprehensive correlations between entities and relations from documents. Most of existing works conduct transfer learning on pre-trained language models (PLMs), which allows for richer contextual representation to improve the performance. However, such PLMs-based methods suffer from incorporating structural knowledge, such as entity-entity interactions. Moreover, current works struggle to infer the implicit relations between entities across different sentences, which results in poor prediction. To deal with the above issues, we propose a novel and effective framework, named DocKS-RAG, which introduces extra structural knowledge and semantic information to further enhance the performance of document-level RE. Specifically, we construct a Document-level Knowledge Graph from the observable documentation data to better capture the structural information between entities and relations. Then, a Sentence-level Semantic Retrieval-Augmented Generation mechanism is designed to consider the similarity in different sentences by retrieving the relevant contextual semantic information. Furthermore, we present a hybrid-prompt tuning method on large language models (LLMs) for specific document-level RE tasks. Finally, extensive experiments conducted on two benchmark datasets demonstrate that our proposed framework enhances all the metrics compared with state-of-the-art methods.
Paperid:1876
Authors:Chenghao Fan · zhenyi lu · Sichen Liu · Xiaoye Qu · Wei Wei · Chengfeng Gu · Yu Cheng
Abstract:
While LowRank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singularvalue decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. {Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture.}{However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture.} To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE’s efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT’s state-of-the-art performance, closing the gap with Full FT. \footnote{Our code is available at: \url{https://anonymous.4open.science/r/goat-827B}.}
Paperid:1877
Authors:Sidhanth Holalkere · David S Bindel · Silvia Sellán · Alexander Terenin
Abstract:
Poisson Surface Reconstruction is a widelyused algorithm for reconstructing a surface from an oriented point cloud. To facilitate applications where only partial surface information is available, or scanning is performed sequentially, a recent line of work proposes to incorporate uncertainty into the reconstructed surface via Gaussian process models. The resulting algorithms first perform Gaussian process interpolation, then solve a set of volumetric partial differential equations globally in space, resulting in a computationally expensive two-stage procedure. In this work, we apply recently-developed techniques from geometric Gaussian processes to combine interpolation and surface reconstruction into a single stage, requiring only one linear solve per sample. The resulting reconstructed surface samples can be queried locally in space, without the use of problem-dependent volumetric meshes or grids. These capabilities enable one to (a) perform probabilistic collision detection locally around the region of interest, (b) perform ray casting without evaluating points not on the ray's trajectory, and (c) perform next-view planning on a per-slice basis. They also improve reconstruction quality, by not requiring one to approximate kernel matrix inverses with diagonal matrices as part of intermediate computations. Results show that our approach provides a cleaner, more-principled, and more-flexible stochastic surface reconstruction pipeline.
Paperid:1878
Authors:Vincent Plassier · Alexander Fishkov · Victor Dheur · Mohsen Guizani · Souhaib Ben Taieb · Maxim Panov · Eric Moulines
Abstract:
We present a new method for generating confidence sets within the split conformal prediction framework. Our method performs a trainable transformation of any given conformity score to improve conditional coverage while ensuring exact marginal coverage. The transformation is based on an estimate of the conditional quantile of conformity scores. The resulting method is particularly beneficial for constructing adaptive confidence sets in multioutput problems where standard conformal quantile regression approaches have limited applicability. We develop a theoretical bound that captures the influence of the accuracy of the quantile estimate on the approximate conditional validity, unlike classical bounds for conformal prediction methods that only offer marginal coverage. We experimentally show that our method is highly adaptive to the local data structure and outperforms existing methods in terms of conditional coverage, improving the reliability of statistical inference in various applications.
Paperid:1879
Authors:Zongmeng Zhang · Wengang Zhou · Jie Zhao · Houqiang Li
Abstract:
Despite the impressive capabilities of multimodal large language models (MLLMs) in visionlanguage tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.
Paperid:1880
Authors:Peter Sorrenson · Daniel Behrend-Uriarte · Christoph Schnörr · Ullrich Koethe
Abstract:
Densitybased distances (DBDs) provide a principled approach to metric learning by defining distances in terms of the underlying data distribution. By employing a Riemannian metric that increases in regions of low probability density, shortest paths naturally follow the data manifold. Fermat distances, a specific type of DBD, have attractive properties, but existing estimators based on nearest neighbor graphs suffer from poor convergence due to inaccurate density estimates. Moreover, graph-based methods scale poorly to high dimensions, as the proposed geodesics are often insufficiently smooth. We address these challenges in two key ways. First, we learn densities using normalizing flows. Second, we refine geodesics through relaxation, guided by a learned score model. Additionally, we introduce a dimension-adapted Fermat distance that scales intuitively to high dimensions and improves numerical stability. Our work paves the way for the practical use of density-based distances, especially in high-dimensional spaces.
Paperid:1881
Authors:Xun Xian · Ganghua Wang · Xuan Bi · Rui Zhang · Jayanth Srinivasa · Ashish Kundu · Charles Fleming · Mingyi Hong · Jie Ding
Abstract:
RetrievalAugmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains such as healthcare, finance, and legal contexts. Given a query, RAG retrieves relevant documents from a corpus and integrates them into the LLMs’ generation process. In this study, we investigate the adversarial robustness of RAG, focusing specifically on examining the retrieval system. First, across 225 different setup combinations of corpus, retriever, query, and targeted information, we show that retrieval systems are vulnerable to universal poisoning attacks in medical Q&A. In such attacks, adversaries generate poisoned documents containing a broad spectrum of targeted information, such as personally identifiable information. When these poisoned documents are inserted into a corpus, they can be accurately retrieved by any users, as long as attacker-specified queries are used. To understand this vulnerability, we discovered that the deviation from the query’s embedding to that of the poisoned document tends to follow a pattern in which the high similarity between the poisoned document and the query is retained, thereby enabling precise retrieval. Based on these findings, we develop a new detection-based defense to ensure the safe use of RAG. Through extensive experiments spanning various Q&A domains, we observed that our proposed method consistently achieves excellent detection rates in nearly all cases.
Paperid:1882
Authors:Yifan Wang · Hourun Li · Ling Yue · Zhiping Xiao · Jia Yang · Changling Zhou · Wei Ju · Ming Zhang · Xiao Luo
Abstract:
Graph neural networks (GNNs) have demonstrated superior performance in graph fairness learning, which aims to ensure that predictions are unbiased with respect to sensitive attributes.However, existing approaches usually assume that training and test data share the same distribution, which rarely holds in the real world. To address this challenge, we propose a novel approach named Dual Unbiased Expansion with Groupacquired Alignment (DANCE) for graph fairness learning under distribution shifts. The core idea of our DANCE is to generate challenging yet unbiased virtual graph data in both graph and hidden spaces, simulating distribution shifts from a data-centric view. Specifically, we introduce the unbiased Mixup in the hidden space, prioritizing minor groups to address the potential imbalance of sensitive attributes. Simultaneously, we conduct fairness-aware adversarial learning in the graph space to focus on challenging samples and improve model robustness. To further bridge the domain gap, we propose a group-acquired alignment objective that prioritizes negative pair groups with identical sensitive labels. Additionally, a representation disentanglement objective is adopted to decorrelate sensitive attributes and target representations for enhanced fairness. Extensive experiments on various benchmark datasets validate the effectiveness of the proposed DANCE compared to existing methods.
Paperid:1883
Authors:Maynara de Souza · Cleber Zanchettin
Abstract:
Data scarcity and quality issues remain significant barriers to developing robust predictive models in medical research. Traditional reliance on realworld data often leads to biased models with poor generalizability across diverse patient populations. Synthetic data generation has emerged as a promising solution, yet challenges related to these sample's representativeness and effective utilization persist. This paper introduces Profile2Gen, a novel data-centric framework designed to guide the generation and refinement of synthetic data, focusing on addressing hard-to-learn samples in regression tasks. We conducted approximately 18,000 experiments to validate its effectiveness across six medical datasets, utilizing seven state-of-the-art generative models. Results demonstrate that refined synthetic samples can reduce predictive errors and enhance model reliability. Additionally, we generalize the DataIQ framework to support regression tasks, enabling its application in broader contexts. Statistical analyses confirm that our approach achieves equal or superior performance compared to models trained exclusively on real data.
Paperid:1884
Authors:Yucheng Xie · Fu Feng · Ruixiao Shi · Jing Wang · Xin Geng · Yong Rui
Abstract:
Abstract:Pretrained models have become the preferred backbone due to the increasing complexity of model parameters. However, traditional pre-trained models often face deployment challenges due to their fixed sizes, and are prone to negative transfer when discrepancies arise between training tasks and target tasks.To address this, we propose \textbf{KIND}, a novel pre-training method designed to construct decomposable models.KIND integrates knowledge by incorporating Singular Value Decomposition (SVD) as a structural constraint, with each basic component represented as a combination of a column vector, singular value, and row vector from the $U$, $\Sigma$, and $V$ matrices.These components are categorized into \textbf{learngenes} for encapsulating class-agnostic knowledge and \textbf{tailors} for capturing class-specific knowledge, with knowledge diversion facilitated by a class gate mechanism during training.Extensive experiments demonstrate that models pre-trained with KIND can be decomposed into learngenes and tailors, which can be adaptively recombined for diverse resource-constrained deployments. Moreover, for tasks with large domain shifts, transferring only learngenes with task-agnostic knowledge, when combined with randomly initialized tailors, effectively mitigates domain shifts.
Paperid:1885
Authors:Jiale Fu · Yuchu Jiang · Junkai Chen · Jiaming Fan · Xin Geng · Xu Yang
Abstract:
Ensemble methods enhance Large Language Models (LLMs) by combining multiple models but suffer from high computational costs. In this paper, we introduceSpeculative Ensemble, a novel framework that accelerates LLM ensembles without sacrificing performance, inspired by Speculative Decoding—where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel. Our approach builds on two key insights: (1) the verification distribution can be the ensemble distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to ensembles with (n) models and theoretically prove that SE is never slower than a standard ensemble, typically achieving faster speed. Extensive experiments demonstrate speed improvements of1.11x–2.23xover standard ensemble techniques without compromising generation quality. Our code is available at https://anonymous.4open.science/r/SpeculativeEnsemble/.
Paperid:1886
Authors:Hengrui Lou · Zunlei Feng · Jinsong Geng · Erteng Liu · Jie Lei · Lechao Cheng · Jie Song · Mingli Song · Yijun Bei
Abstract:
With the rise of AIGC technologies, particularly diffusion models, highly realistic fake images that can deceive human visual perception has become feasible. Consequently, various forgery detection methods have emerged. However, existing methods treat the generation process of fake images as either a blackbox or an auxiliary tool, offering limited insights into its underlying mechanisms. In this paper, we propose Spatio-Temporal Distribution Fitting Deviation (STD-FD) for AIGC forgery detection, which explores the generative process in detail. By decomposing and reconstructing data within generative diffusion models, initial experiments reveal temporal distribution fitting deviations during the image reconstruction process. These deviations are captured through reconstruction noise maps for each spatial semantic unit, derived via a super-resolution algorithm. Critical discriminative patterns, termed DFactors, are identified through statistical modeling of these deviations. Extensive experiments show that STD-FD effectively captures distribution patterns in AIGC-generated data, demonstrating strong robustness and generalizability while outperforming state-of-the-art (SOTA) methods on major datasets. The source code is available atthis link.
Paperid:1887
Authors:Zhengpeng Xie · Qiang Zhang · Fan Yang · Marco Hutter · Renjing Xu
Abstract:
Modelfree reinforcement learning algorithms have seen remarkable progress, but key challenges remain. Trust Region Policy Optimization (TRPO) is known for ensuring monotonic policy improvement through conservative updates within a trust region, backed by strong theoretical guarantees. However, its reliance on complex second-order optimization limits its practical efficiency. Proximal Policy Optimization (PPO) addresses this by simplifying TRPO's approach using ratio clipping, improving efficiency but sacrificing some theoretical robustness. This raises a natural question: Can we combine the strengths of both methods? In this paper, we introduce Simple Policy Optimization (SPO), a novel unconstrained first-order algorithm. By slightly modifying the policy loss used in PPO, SPO can achieve the best of both worlds. Our new objective improves upon ratio clipping, offering stronger theoretical properties and better constraining the probability ratio within the trust region. Empirical results demonstrate that SPO outperforms PPO with a simple implementation, particularly for training large, complex network architectures end-to-end.
Paperid:1888
Authors:Boaz Taitler · Omer Ben-Porat
Abstract:
The rise of Generative AI (GenAI) has significantly impacted humanbased forums like Stack Overflow, which are essential for generating high-quality data. This creates a negative feedback loop, hindering the development of GenAI systems, which rely on such data to provide accurate responses. In this paper, we provide a possible remedy: A novel strategy we callselective response. Selective response implies that GenAI could strategically provide inaccurate (or conservative) responses to queries involving emerging topics and novel technologies, thereby driving users to use human-based forums like Stack Overflow. We show that selective response can potentially have a compounding effect on the data generation process, increasing both GenAI's revenue and user welfare in the long term. From an algorithmic perspective, we propose an approximately optimal approach to maximize GenAI's revenue under social welfare constraints. From a regulatory perspective, we derive sufficient and necessary conditions for selective response to improve welfare improvements.
Paperid:1889
Authors:Blake Bordelon · Cengiz Pehlevan
Abstract:
Abstract:We theoretically characterize gradient descent dynamics in deep linear networks trained at large width from random initialization and on large quantities of random data. Our theory captures the ``wider is better" effect of meanfield/maximum-update parameterized networks as well as hyperparameter transfer effects, which can be contrasted with the neural-tangent parameterization where optimal learning rates shift with model width. We provide asymptotic descriptions of both non-residual and residual neural networks, the latter of which enables an infinite depth limit when branches are scaled as $1/\sqrt{\text{depth}}$. We also compare training with one-pass stochastic gradient descent to the dynamics when training data are repeated at each iteration. Lastly, we show that this model recovers the accelerated power law training dynamics for power law structured data in the rich regime observed in recent works.
Paperid:1890
Authors:Daniil Laptev · Nikita Balagansky · Yaroslav Aksenov · Daniil Gavrilov
Abstract:
We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined interlayer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.
Paperid:1891
Authors:Dachuan Shi · Yonggan Fu · Xiangchi Yuan · Zhongzhi Yu · Haoran You · Sixu Li · Xin Dong · Jan Kautz · Pavlo Molchanov · Yingyan (Celine) Lin
Abstract:
Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust longrange capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously improve all two critical requirements when it comes to long-range LLMs: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets, facilitating optimal cache utilization. Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache's effectiveness in enhancing LLMs' long-range capabilities. The code will be released upon acceptance.
Paperid:1892
Authors:Giovanni Catania · Aurélien Decelle · Cyril Furtlehner · Beatriz Seoane
Abstract:
We investigate the impact of limited data on training pairwise energybased models for inverse problems aimed at identifying interaction networks. Utilizing the Gaussian model as testbed, we dissect training trajectories across the eigenbasis of the coupling matrix, exploiting the independent evolution of eigenmodes and revealing that the learning timescales are tied to the spectral decomposition of the empirical covariance matrix. We see that optimal points for early stopping arise from the interplay between these timescales and the initial conditions of training. Moreover, we show that finite data corrections can be accurately modeled through asymptotic random matrix theory calculations and provide the counterpart of generalized cross-validation in the energy based model context. Our analytical framework extends to binary-variable maximum-entropy pairwise models with minimal variations. These findings offer strategies to control overfitting in discrete-variable models through empirical shrinkage corrections, improving the management of overfitting in energy-based generative models.
Paperid:1893
Authors:Xin Huang · Eric M. Wolff · Paul Vernaza · Tung Phan-Minh · Hongge Chen · David Hayden · Mark Edmonds · Brian Pierce · Xinxin Chen · Pratik Elias Jacob · Xiaobai Chen · Chingiz Tairbekov · Pratik Agarwal · Tianshi Gao · Yuning Chai · Siddhartha Srinivasa
Abstract:
We present DriveGPT, a scalable behavior model for autonomous driving. We model driving as a sequential decisionmaking task, and learn a transformer model to predict future agent states as tokens in an autoregressive fashion. We scale up our model parameters and training data by multiple orders of magnitude, enabling us to explore the scaling properties in terms of dataset size, model parameters, and compute. We evaluate DriveGPT across different scales in a planning task, through both quantitative metrics and qualitative examples, including closed-loop driving in complex real-world scenarios. In a separate prediction task, DriveGPT outperforms state-of-the-art baselines and exhibits improved performance by pretraining on a large-scale dataset, further validating the benefits of data scaling.
Paperid:1894
Authors:Yuqi Luo · Chenyang Song · Xu Han · Yingfa Chen · Chaojun Xiao · Xiaojun Meng · Liqun Deng · Jiansheng Wei · Zhiyuan Liu · Maosong Sun
Abstract:
Abstract:Activation sparsity denotes the existence of substantial weaklycontributed neurons within feed-forward networks of large language models (LLMs), providing wide potential benefits such as computation acceleration. However, existing works lack thorough quantitative studies on this useful property, in terms of both its measurement and influential factors. In this paper, we address three underexplored research questions: (1) How can activation sparsity be measured more accurately? (2) How is activation sparsity affected by the model architecture and training process? (3) How can we build a more sparsely activated and efficient LLM? Specifically, we develop a generalizable and performance-friendly metric, named CETT-PPL-1\%, to measure activation sparsity. Based on CETT-PPL-1\%, we quantitatively study the influence of various factors and observe several important phenomena, such as the convergent power-law relationship between sparsity and training data amount, the higher competence of ReLU activation than mainstream SiLU activation, the potential sparsity merit of a small width-depth ratio, and the scale insensitivity of activation sparsity. Finally, we provide implications for building sparse and effective LLMs, and demonstrate the reliability of our findings by training a 2.4B model with a sparsity ratio of 93.52\%, showing 4.1$\times$ speedup compared with its dense version.
Paperid:1895
Authors:Gen Li · Yuchen Jiao
Abstract:
Diffusion models have emerged as a powerful framework for generative modeling, with guidance techniques playing a crucial role in enhancing sample quality. Despite their empirical success, a comprehensive theoretical understanding of the guidance effect remains limited. Existing studies only focus on case studies, where the distribution conditioned on each class is either isotropic Gaussian or supported on a onedimensional interval with some extra conditions. How to analyze the guidance effect beyond these case studies remains an open question. Towards closing this gap, we make an attempt to analyze diffusion guidance under general data distributions. Rather than demonstrating uniform sample quality improvement, which does not hold in some distributions, we prove that guidance can improve the whole sample quality, in the sense that the ratio of bad samples (measured by the classifier probability) decreases in the presence of guidance. This aligns with the motivation of introducing guidance.
Paperid:1896
Authors:Qingyuan Wu · Simon Zhan · Yuhui Wang · Yixuan Wang · Chung-Wei Lin · Chen Lv · Qi Zhu · Jürgen Schmidhuber · Chao Huang
Abstract:
Reinforcement learning (RL) with delays is challenging as sensory perceptions lag behind the actual events: the RL agent needs to estimate the real state of its environment based on past observations. Stateof-the-art (SOTA) methods typically employ recursive, step-by-step forecasting of states. This can cause the accumulation of compounding errors. To tackle this problem, our novel belief estimation method named Directly Forecasting Belief Transformer (DFBT) directly forecasts states from observations without incrementally estimating intermediate states step-by-step. We theoretically demonstrate that DFBT greatly reduces compounding errors of existing recursively forecasting methods, yielding stronger performance guarantees. In experiments with D4RL offline datasets, DFBT reduces compounding errors with remarkable prediction accuracy. DFBT's capability to forecast state sequences also facilitates multi-step bootstrapping, thus greatly improving learning efficiency. On the MuJoCo benchmark, our DFBT-based method substantially outperforms SOTA baselines.
Paperid:1897
Authors:Noah Golowich · Samy Jelassi · David Brandfonbrener · Sham Kakade · Eran Malach
Abstract:
Training large language models to predict beyond their training context lengths has drawn much attention in recent years, yet the principles driving such behavior of length generalization remain underexplored. We propose a new theoretical framework to study length generalization for the nexttoken prediction task, as performed by decoder-only transformers. Conceptually, we show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We formalize such tasks via a notion we call k-sparse planted correlation distributions, and show that an idealized model of transformers which generalize attention heads successfully length-generalize on such tasks. As a bonus, our theoretical model allows us to provide justifications for techniques to modify positional embeddings which have been introduced to improve length generalization, such as position coupling. We support our theoretical results with experiments on synthetic tasks and natural language, which confirm that a key factor driving length generalization is indeed a ``sparse'' dependency structure of each token on the previous ones. Further, inspired by our theory, we introduce Predictive Position Coupling, a generalization of position coupling which trains the transformer to predict the position IDs used in a positional coupling approach. Predictive Position Coupling thereby allows us to broaden the array of tasks to which Position Coupling can successfully be applied to achieve length generalization.
Paperid:1898
Authors:Pierre Boyeau · Anastasios Angelopoulos · Tianle Li · Nir Yosef · Jitendra Malik · Michael Jordan
Abstract:
The evaluation of machine learning models using humanlabeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased.
Paperid:1899
Authors:Mathilde Papillon · Guillermo Bernardez · Claudio Battiloro · Nina Miolane
Abstract:
Graph Neural Networks (GNNs) excel in learning from relational datasets, processing node and edge features in a way that preserves the symmetries of the graph domain. However, many complex systemssuch as biological or social networks---involve multiway complex interactions that are more naturally represented by higher-order topological domains. The emerging field of Topological Deep Learning (TDL) aims to accommodate and leverage these higher-order structures. Combinatorial Complex Neural Networks (CCNNs), fairly general TDL models, have been shown to be more expressive and better performing than GNNs. However, differently from the GNN ecosystem, TDL lacks a principled and standardized framework for easily defining new architectures, restricting its accessibility and applicability. To address this issue, we introduce Generalized CCNNs (GCCNs), a novel simple yet powerful family of TDL models that can be used to systematically transform any (graph) neural network into its TDL counterpart. We prove that GCCNs generalize and subsume CCNNs, while extensive experiments on a diverse class of GCCNs show that these architectures consistently match or outperform CCNNs, often with less model complexity. In an effort to accelerate and democratize TDL, we introduce TopoTune, a lightweight software for defining, building, and training GCCNs with unprecedented flexibility and ease.
Paperid:1900
Authors:Petar Veličković · Christos Perivolaropoulos · Federico Barbero · Razvan Pascanu
Abstract:
A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable querykey lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions with increasing problem size, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
Paperid:1901
Authors:Hantao Lou · Changye Li · Jiaming Ji · Yaodong Yang
Abstract:
With the integration of image modality, the semantic space of multimodal large language models (MLLMs) is more complex than textonly models, making their interpretability more challenging and their alignment less stable, particularly susceptible to low-quality data, which can lead to inconsistencies between modalities, hallucinations, and biased outputs. As a result, developing interpretability methods for MLLMs is crucial for improving alignment quality and efficiency. In text-only LLMs, Sparse Autoencoders (SAEs) have gained attention for their ability to interpret latent representations. However, extending SAEs to multimodal settings presents new challenges due to modality fusion and the difficulty of isolating cross-modal representations. To address these challenges, we introduce SAE-V, a mechanistic interpretability framework that extends the SAE paradigm to MLLMs. By identifying and analyzing interpretable features along with their corresponding data, SAE-V enables fine-grained interpretation of both model behavior and data quality, facilitating a deeper understanding of cross-modal interactions and alignment dynamics. Moreover, by utilizing cross-modal feature weighting, SAE-V provides an intrinsic data filtering mechanism to enhance model alignment without requiring additional models. Specifically, when applied to the alignment process of MLLMs, SAE-V-based data filtering methods could achieve more than 110% performance with less than 50% data. Our results highlight SAE-V’s ability to enhance interpretability and alignment in MLLMs, providing insights into their internal mechanisms.
Paperid:1902
Authors:Bo-Han Lai · Pin-Han Huang · Bo-Han Kung · Shang-Tse Chen
Abstract:
Lipschitz neural networks are wellknown for providing certified robustness in deep learning. In this paper, we present a novel, efficient Block Reflector Orthogonal (BRO) layer that enhances the capability of orthogonal layers on constructing more expressive Lipschitz neural architectures. In addition, by theoretically analyzing the nature of Lipschitz neural networks, we introduce a new loss function that employs an annealing mechanism to increase margin for most data points. This enables Lipschitz models to provide better certified robustness. By employing our BRO layer and loss function, we design BRONet — a simple yet effective Lipschitz neural network that achieves state-of-the-art certified robustness. Extensive experiments and empirical analysis on CIFAR-10/100, Tiny-ImageNet, and ImageNet validate that our method outperforms existing baselines.
Paperid:1903
Authors:Shreyas Kadekodi · Hayden McTavish · Berk Ustun
Abstract:
Abstract:Many applications in learning and decisionmaking rely on procedures to aggregate ordinal preferences -- whether for voting, search, or alignment. In such tasks, individuals express preferences over a set of items by voting, rating, or making pairwise comparisons. We then summarize their collective preferences by constructing a ranking. Standard methods for preference aggregation are designed to arbitrate individual disagreements to rank items in a way that is faithful and fair. In this paper, we introduce a paradigm for *selective aggregation*, where we can either abstain from comparison or arbitrate dissent. Given a dataset of individual preferences, we summarize collective preferences as a *selective ranking*—a *partial order* that allows comparisons only between items on which at least $1 - \tau{}$% of individuals agree. We develop fast algorithms to construct selective rankings that achieve all possible trade-offs between comparability and disagreement, paired with practical guarantees to ensure safety and reliability. We conduct extensive experiments to benchmark our approach on real-world datasets for ranking and learning. Our results demonstrate how selective rankings promote transparency, robustness, and fairness by revealing disagreement and abstaining from arbitration.
Paperid:1904
Authors:Li Shen · Anke Tang · Yong Luo · Tao Sun · Han Hu · Xiaochun Cao
Abstract:
Pruning is a widely used technique for compressing large neural networks that eliminates weights that have minimal impact on the model's performance. Current pruning methods, exemplified by magnitude pruning, assign an importance score to each weight based on its magnitude and remove weights with scores below a certain threshold. Nonetheless, these methods often create a gap between the original dense and the pruned sparse model, potentially impairing performance. Especially when the sparsity ratio is high, the gap becomes more pronounced. To mitigate this issue, we introduce a method to bridge the gap left by pruning by utilizing a lowrank approximation of the difference between the dense and sparse matrices. Our method entails the iterative refinement of the sparse weight matrix augmented by a low-rank adjustment. This technique captures and retains the essential information often lost during pruning, thereby improving the performance of the pruned model. Furthermore, we offer a comprehensive theoretical analysis of our approach, emphasizing its convergence properties and establishing a solid basis for its efficacy. Experimental results on LLaMa models validate its effectiveness on large language models across various pruning techniques and sparsity levels. Our method shows significant improvements: at 50\% sparsity, it reduces perplexity by 53.9\% compared to conventional magnitude pruning on LLaMa-7B. Furthermore, to achieve a specific performance target, our approach enables an 8.6\% reduction in model parameters while maintaining a sparsity ratio of about 50\%.
Paperid:1905
Authors:Brahma Pavse · Yudong Chen · Qiaomin Xie · Josiah Hanna
Abstract:
In reinforcement learning, offline value function learning is the procedure of using an offline dataset to estimate the expected discounted return from each state when taking actions according to a fixed target policy. The stability of this procedure, i.e., whether it converges to its fixedpoint, critically depends on the representations of the state-action pairs. Poorly learned representations can make value function learning unstable, or even divergent. Therefore, it is critical to stabilize value function learning by explicitly shaping the state-action representations. Recently, the class of bisimulation-based algorithms have shown promise in shaping representations for control. However, it is still unclear if this class of methods can \emph{stabilize} value function learning. In this work, we investigate this question and answer it affirmatively. We introduce a bisimulation-based algorithm called kernel representations for offline policy evaluation (\textsc{krope}). \textsc{krope} uses a kernel to shape state-action representations such that state-action pairs that have similar immediate rewards and lead to similar next state-action pairs under the target policy also have similar representations. We show that \textsc{krope}: 1) learns stable representations and 2) leads to lower value error than baselines. Our analysis provides new theoretical insight into the stability properties of bisimulation-based methods and suggests that practitioners can use these methods for stable and accurate evaluation of offline reinforcement learning agents.
Paperid:1906
Authors:Yefei He · Feng Chen · Yuanyu He · Shaoxuan He · Hong Zhou · Kaipeng Zhang · Bohan Zhuang
Abstract:
In this paper, we propose ZipAR, a trainingfree, plug-and-play parallel decoding framework for accelerating autoregressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction scheme in the row dimension, the tokens corresponding to spatially adjacent regions in the column dimension can be decoded in parallel. To ensure alignment with the contextual requirements of each token, we employ an adaptive local window assignment scheme with rejection sampling analogous to speculative decoding. By decoding multiple tokens in a single forward pass, the number of forward passes required to generate an image is significantly reduced, resulting in a substantial improvement in generation efficiency. Experiments demonstrate that ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining.
Paperid:1907
Authors:Fabio Feser · Marina Evangelou
Abstract:
The sparsegroup lasso performs both variable and group selection, simultaneously using the strengths of the lasso and group lasso. It has found widespread use in genetics, a field that regularly involves the analysis of high-dimensional data, due to its sparse-group penalty, which allows it to utilize grouping information. However, the sparse-group lasso can be computationally expensive, due to the added shrinkage complexity, and its additional hyperparameter that needs tuning. This paper presents a novel feature reduction method, Dual Feature Reduction (DFR), that uses strong screening rules for the sparse-group lasso and the adaptive sparse-group lasso to reduce their input space before optimization, without affecting solution optimality. DFR applies two layers of screening through the application of dual norms and subdifferentials. Through synthetic and real data studies, it is shown that DFR drastically reduces the computational cost under many different scenarios.
Paperid:1908
Authors:Payman Behnam · Yaosheng Fu · Ritchie Zhao · Po-An Tsai · Zhiding Yu · Alexey Tumanov
Abstract:
Transformerbased Large Language Models rely critically on KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy designed specifically to reduce both memory bandwidth and capacity demand of KV cache during the decode phase. RocketKV contains two consecutive stages. In the first stage, it performs coarse grain KV cache eviction on the input sequence tokens with SnapKV++, a method improved upon SnapKV by introducing adaptive pooling size and full compatibility with grouped-query attention. In the second stage, it adopts a hybrid attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensional reductions. Combining these two stages, RocketKV achieves significant KV cache fetching bandwidth and storage savings while maintaining comparable accuracy to full KV cache attention. We show that RocketKV provides end-to-end speedup by up to 3× as well as peak memory reduction by up to 31% in the decode phase on an NVIDIA H100 GPU compared to full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks.
Paperid:1909
Authors:Yi He · Yiming Yang · Xiaoyuan Cheng · Hai Wang · Xiao Xue · Boli Chen · Yukun Hu
Abstract:
Abstract:Generating longterm trajectories of dissipative chaotic systems autoregressively is a highly challenging task. The inherent positive Lyapunov exponents amplify prediction errors over time.Many chaotic systems possess a crucial property — ergodicity on their attractors, which makes long-term prediction possible. State-of-the-art methods address ergodicity by preserving statistical properties using optimal transport techniques. However, these methods face scalability challenges due to the curse of dimensionality when matching distributions. To overcome this bottleneck, we propose a scalable transformer-based framework capable of stably generating long-term high-dimensional and high-resolution chaotic dynamics while preserving ergodicity. Our method is grounded in a physical perspective, revisiting the Von Neumann mean ergodic theorem to ensure the preservation of long-term statistics in the $\mathcal{L}^2$ space. We introduce novel modifications to the attention mechanism, making the transformer architecture well-suited for learning large-scale chaotic systems. Compared to operator-based and transformer-based methods, our model achieves better performances across five metrics, from short-term prediction accuracy to long-term statistics. In addition to our methodological contributions, we introduce a new chaotic system benchmark: a machine learning dataset of 140$k$ snapshots of turbulent channel flow along with various evaluation metrics for both short- and long-term performances, which is well-suited for machine learning research on chaotic systems.
Paperid:1910
Authors:Liangliang Shi · Shenhui Zhang · Xingbo Du · Nianzu Yang · Junchi Yan
Abstract:
Global routing (GR) is a fundamental yet NPHard task in modern chip design and various learning techniques have been utilized to solve it. However, a persistent challenge is the inherent lack of a mechanism to guarantee the routing connectivity in network's prediction results, necessitating post-processing search or reinforcement learning to enforce the connectivity. In this paper, we fulfill a neural GR solver called DSBRouter, which leverages the Diffusion Schr\"{o}dinger Bridge (DSB) model for GR. During training, unlike previous works that learn the mapping from noise to routes, we establish a bridge between the initial pins and the routing via DSB, which learns the forward and backward mapping between them. For inference, based on the evaluation metric (e.g. low overflow), we further introduce a sampling scheme with evaluation-based guidance to enhance the routing predictions. Note that DSBRouter is an end-to-end model that does not require a post-phase algorithm to ensure connectivity. Empirical results show that DSBRouter achieves SOTA performance on the overflow reduction in ISPD98 and part of ISPD07. In some cases, DSBRouter can even generate routes with an overflow of 0.
Paperid:1911
Authors:Erchi Wang · Yu-Xiang Wang · Yuqing Zhu
Abstract:
Abstract:This paper studies the problem of differentially private empirical risk minimization (DPERM) for binary linear classification. We obtain an efficient $(\varepsilon,\delta)$-DP algorithm with an empirical zero-one risk bound of $\tilde{O}\left(\frac{1}{\gamma^2\varepsilon n} + \frac{|S_{\mathrm{out}}|}{\gamma n}\right)$ where $n$ is the number of data points, $S_{\mathrm{out}}$ is an arbitrary subset of data one can remove and $\gamma$ is the margin of linear separation of the remaining data points (after $S_{\mathrm{out}}$ is removed). Here, $\tilde{O}(\cdot)$ hides only logarithmic terms. In the agnostic case, we improve the existing results when the number of outliers is small. Our algorithm is highly adaptive because it does not require knowing the margin parameter $\gamma$ or outlier subset $S_{\mathrm{out}}$. We also derive a utility bound for the advanced private hyperparameter tuning algorithm.
Paperid:1912
Authors:Jiazheng Li · Lu Yu · Qing Cui · Zhiqiang Zhang · JUN ZHOU · Yanfang Ye · Chuxu Zhang
Abstract:
Highquality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to identify subsets of data that can effectively and efficiently enhance model performance. However, most of these methods focus on general data selection and tend to overlook the specific nuances of domain-related data. In this paper, we introduce MASS, a Mathematical data Selection framework using the Skill graph for pretraining LLMs in the mathematical reasoning domain. By taking into account the unique characteristics of mathematics and reasoning, we construct a skill graph that captures the mathematical skills and their interrelations from a reference dataset. This skill graph guides us in assigning quality scores to the target dataset, enabling us to select the top-ranked subset which is further used to pretrain LLMs. Experimental results demonstrate the efficiency and effectiveness of MASS across different model sizes (1B and 7B) and pretraining datasets (web data and synthetic data). Specifically, in terms of efficiency, models trained on subsets selected by MASS can achieve similar performance to models trained on the original datasets, with a significant reduction in the number of trained tokens - ranging from 50\% to 70\% fewer tokens. In terms of effectiveness, when trained on the same amount of tokens, models trained on the data selected by MASS outperform those trained on the original datasets by 3.3\% to 5.9\%. These results underscore the potential of MASS to improve both the efficiency and effectiveness of pretraining LLMs.
Paperid:1913
Authors:Antoine Gonon · Léon Zheng · Pascal Carrivain · TUNG LE
Abstract:
This paper \emph{benchmarks} and \emph{improves} existing GPU matrix multiplication algorithms specialized for Kroneckersparse matrices, whose sparsity patterns are described by Kronecker products. These matrices have recently gained popularity as replacements for dense matrices in neural networks because they preserve accuracy while using fewer parameters. We present the first energy and time benchmarks for the multiplication with such matrices, helping users identify scenarios where Kronecker-sparse matrices are more time- and energy-efficient than their dense counterparts. Our benchmark also reveals that specialized implementations spend up to 50% of their total runtime on memory rewriting operations. To address the challenge of reducing memory transfers, we introduce a new tiling strategy for matrix multiplication with Kronecker-sparse matrices, which reduces reads and writes (I/O transfers) between levels of GPU memory. We implement this tiling strategy in a new CUDA kernel that achieves a median speed-up of x1.4, while also cutting energy consumption by 15%. We further demonstrate the broader impact of our results by applying the new kernel to accelerate transformer inference.
Paperid:1914
Authors:Alexandre Verine · Florian Le Bronnec · Kunhao Zheng · Alexandre Allauzen · yann CHEVALEYRE · benjamin negrevergne
Abstract:
Increasing diversity in language models is a challenging yet essential objective. A common approach is raising the decoding temperature. In this work, we investigate this approach through a simplistic yet common case to provide insights into why decreasing temperature can improve quality (Precision) but increasing it often fails to boost coverage (Recall). Our analysis reveals that for a model to be effectively tunable through temperature adjustments, it must be trained toward coverage. To address this, we propose rethinking loss functions in language models, leveraging the PrecisionRecall framework. Our results demonstrate that this approach helps to achieve a substantially better trade off between Precision and Recall than merely combining negative loglikelihood training with temperature scaling. These findings offer a pathway toward more versatile and robust language modeling techniques.
Paperid:1915
Authors:Henrik Schopmans · Pascal Friederich
Abstract:
Efficient sampling of unnormalized probability densities such as theBoltzmann distribution of molecular systems is a longstanding challenge.Next to conventional approaches like molecular dynamics or Markov chainMonte Carlo, variational approaches, such as training normalizing flows withthe reverse KullbackLeibler divergence, have been introduced. However, suchmethods are prone to mode collapse and often do not learn to sample the fullconfigurational space. Here, we present temperature-annealed Boltzmanngenerators (TA-BG) to address this challenge. First, we demonstrate thattraining a normalizing flow with the reverse Kullback-Leibler divergence athigh temperatures is possible without mode collapse. Furthermore, we introduce a reweighting-based training objective to anneal the distribution to lower target temperatures. We apply this methodology to three molecular systems of increasing complexity and, compared to the baseline, achieve better results in almost all metrics while requiring up to three times fewer target energy evaluations. For the largest system, our approach is the only method that accurately resolves the metastable states of the system.
Paperid:1916
Authors:Yilun Kuang · Noah Amsel · Sanae Lotfi · Shikai Qiu · Andres Potapczynski · Andrew Wilson
Abstract:
The core component of attention is the scoring function, which transforms the inputs into lowdimensional queries and keys and takes the dot product of each pair. While the low-dimensional projection improves efficiency, it causes information loss for certain tasks that have intrinsically high-dimensional inputs. Additionally, attention uses the same scoring function for all input pairs, without imposing a locality bias for neighboring tokens in the sequence. In this work, we address these shortcomings by proposing new scoring functions based on computationally efficient structured matrices with high ranks, including Block Tensor-Train (BTT) and Multi-Level Low Rank (MLR) matrices. On in-context regression tasks with high-dimensional inputs, our proposed scoring functions outperform standard attention for any fixed compute budget. On language modeling, a task that exhibits locality, our MLR-based attention method achieves improved scaling laws compared to both standard attention and variants of sliding window attention. Additionally, we show that both BTT and MLR fall under a broader family of efficient structured matrices capable of encoding either full-rank or locality biases, thereby addressing significant shortcomings of standard attention.
Paperid:1917
Authors:Sayan Bhattacharya · Martín Costa · Ermiya Farokhnejad · Silvio Lattanzi · Nikos Parotsidis
Abstract:
Abstract:In this paper, we consider the *metric $k$center* problem in the fully dynamic setting, where we are given a metric space $(V,d)$ evolving via a sequence of point insertions and deletions and our task is to maintain a subset $S \subseteq V$ of at most $k$ points that minimizes the objective $\max_{x \in V} \min_{y \in S}d(x, y)$. We want to design our algorithm so that we minimize its *approximation ratio*, *recourse* (the number of changes it makes to the solution $S$) and *update time* (the time it takes to handle an update). We give a simple algorithm for dynamic $k$-center that maintains a $O(1)$-approximate solution with $O(1)$ amortized recourse and $\tilde O(k)$ amortized update time, *obtaining near-optimal approximation, recourse and update time simultaneously*. We obtain our result by combining a variant of the dynamic $k$-center algorithm of Bateni et al. [SODA'23] with the dynamic sparsifier of Bhattacharya et al. [NeurIPS'23].
Paperid:1918
Authors:Junhang Li · Yu Guo · Xian · Shengfeng He
Abstract:
Images are often obstructed by various obstacles due to capture limitations, hindering the observation of objects of interest. Most existing methods address occlusions from specific elements like fences or raindrops, but are constrained by the wide range of realworld obstructions, making comprehensive data collection impractical. To overcome these challenges, we propose Instruct2See, a novel zero-shot framework capable of handling both seen and unseen obstacles. The core idea of our approach is to unify obstruction removal by treating it as a soft-hard mask restoration problem, where any obstruction can be represented using multi-modal prompts, such as visual semantics and textual instructions, processed through a cross-attention unit to enhance contextual understanding and improve mode control. Additionally, a tunable mask adapter allows for dynamic soft masking, enabling real-time adjustment of inaccurate masks. Extensive experiments on both in-distribution and out-of-distribution obstacles show that Instruct2See consistently achieves strong performance and generalization in obstruction removal, regardless of whether the obstacles were present during training phase.
Paperid:1919
Authors:Artavazd Maranjyan · Alexander Tyurin · Peter Richtarik
Abstract:
Asynchronous Stochastic Gradient Descent (Asynchronous SGD) is a cornerstone method for parallelizing learning in distributed machine learning. However, its performance suffers under arbitrarily heterogeneous computation times across workers, leading to suboptimal time complexity and inefficiency as the number of workers scales. While several Asynchronous SGD variants have been proposed, recent findings by Tyurin & Richtárik (NeurIPS 2023) reveal that none achieve optimal time complexity, leaving a significant gap in the literature. In this paper, we propose Ringmaster ASGD, a novel Asynchronous SGD method designed to address these limitations and tame the inherent challenges of Asynchronous SGD. We establish, through rigorous theoretical analysis, that Ringmaster ASGD achieves optimal time complexity under arbitrarily heterogeneous and dynamically fluctuating worker computation times. This makes it the first Asynchronous SGD method to meet the theoretical lower bounds for time complexity in such scenarios.
Paperid:1920
Authors:Pengsheng Guo · Alex Schwing
Abstract:
We study Variational Rectified Flow Matching, a framework that enhances classic rectified flow matching by modeling multimodal velocity vector-fields. At inference time, classic rectified flow matching 'moves' samples from a source distribution to the target distribution by solving an ordinary differential equation via integration along a velocity vector-field. At training time, the velocity vector-field is learnt by linearly interpolating between coupled samples one drawn from the source and one drawn from the target distribution randomly. This leads to ''ground-truth'' velocity vector-fields that point in different directions at the same location, i.e., the velocity vector-fields are multi-modal/ambiguous. However, since training uses a standard mean-squared-error loss, the learnt velocity vector-field averages ''ground-truth'' directions and isn't multi-modal. In contrast, variational rectified flow matching learns and samples from multi-modal flow directions. We show on synthetic data, MNIST, CIFAR-10, and ImageNet that variational rectified flow matching leads to compelling results.
Paperid:1921
Authors:Shuo Yang · Bardh Prenkaj · Gjergji Kasneci
Abstract:
Shortcut learning undermines model generalization to outof-distribution data. While the literature attributes shortcuts to biases in superficial features, we show that imbalances in the semantic distribution of sample embeddings induce spurious semantic correlations, compromising model robustness. To address this issue, we propose SCISSOR (Semantic Cluster Intervention for Suppressing ShORtcut), a Siamese network-based debiasing approach that remaps the semantic space by discouraging latent clusters exploited as shortcuts. Unlike prior data-debiasing approaches, SCISSOR eliminates the need for data augmentation and rewriting. We evaluate SCISSOR on 6 models across 4 benchmarks: Chest-XRay and Not-MNIST in computer vision, and GYAFC and Yelp in NLP tasks. Compared to several baselines, SCISSOR reports +5.3 absolute points in F1 score on GYAFC, +7.3 on Yelp, +7.7 on Chest-XRay, and +1 on Not-MNIST. SCISSORS is also highly advantageous for lightweight models with ~9.5% improvement on F1 for ViT on computer vision datasets and ~11.9% for BERT on NLP. Our study redefines the landscape of model generalization by addressing overlooked semantic biases, establishing SCISSOR as a foundational framework for mitigating shortcut learning and fostering more robust, bias-resistant AI systems.
Paperid:1922
Authors:Nikita Lagrange · Herve Isambert
Abstract:
Abstract:We propose a greedy searchand-score algorithm for ancestral graphs, which include directed as well as bidirected edges, originating from unobserved latent variables. The normalized likelihood score of ancestral graphs is estimated in terms of multivariate information over relevant "$ac$-connected subset" of vertices, $\boldsymbol{C}$, that are connected through collider paths confined to the ancestor set of $\boldsymbol{C}$. For computational efficiency, the proposed two-step algorithm relies on local information scores limited to the close surrounding vertices of each node (step 1) and edge (step 2). This computational strategy, although restricted to information contributions from $ac$-connected subsets containing up to two-collider paths, is shown to outperform state-of-the-art causal discovery methods on challenging benchmark datasets.
Paperid:1923
Authors:Hongwei Zhang · Ziqi Ye · Xinyuan Wang · Xin Guo · Zenglin Xu · Yuan Cheng · Zixin Hu · Yuan Qi
Abstract:
Abstract:We propose Network Automatic Relevance Determination (NARD), an extension of ARD for linearly probabilistic models, to simultaneously model sparse relationships between inputs $X \in \mathbb R^{d \times N}$ and outputs $Y \in \mathbb R^{m \times N}$, while capturing the correlation structure among the $Y$. NARD employs a matrix normal prior which contains a sparsityinducing parameter to identify and discard irrelevant features, thereby promoting sparsity in the model. Algorithmically, it iteratively updates both the precision matrix and the relationship between $Y$ and the refined inputs. To mitigate the computational inefficiencies of the $\mathcal O(m^3 + d^3)$ cost per iteration, we introduce Sequential NARD, which evaluates features sequentially, and a Surrogate Function Method, leveraging an efficient approximation of the marginal likelihood and simplifying the calculation of determinant and inverse of an intermediate matrix. Combining the Sequential update with the Surrogate Function method further reduces computational costs. The computational complexity per iteration for these three methods is reduced to $\mathcal O(m^3+p^3)$, $\mathcal O(m^3 + d^2)$, $\mathcal O(m^3+p^2)$ respectively, where $p \ll d$ is the final number of features in the model. Our methods demonstrate significant improvements in computational efficiency with comparable performance on both synthetic and real-world datasets.
Paperid:1924
Authors:Yong Wu · Yanwei Fu · Shouyan Wang · Xinwei Sun
Abstract:
Bivariate causal discovery is challenging when unmeasured confounders exist. To adjust for the bias, previous methods employed the proxy variable (i.e., negative control outcome (NCO)) to test the treatmentoutcome relationship through integral equations -- and assumed that violation of this equation indicates the causal relationship. Upon this, they could establish asymptotic properties for causal hypothesis testing. However, these methods either relied on parametric assumptions or suffered from the sample-efficiency issue. Moreover, it is unclear when this underlying integral-related assumption holds, making it difficult to justify the utility in practice.To address these problems, we first consider the scenario where only NCO is available. We propose a novel non-parametric procedure, which enjoys asymptotic properties and better sample efficiency. Moreover, we find that when NCO affects the outcome, the above integral-related assumption may not hold, rendering the causal relation unidentifiable. Informed by this, we further consider the scenario when the negative control exposure (NCE) is also available. In this scenario, we construct another integral restriction aided by this proxy, which can discover causation when NCO affects the outcome. We demonstrate these findings and the effectiveness of our proposals through comprehensive numerical studies.
Paperid:1925
Authors:Jie Gao · Rajesh Jayaram · Benedikt Kolbe · Shay Sapir · Chris Schwiegelshohn · Sandeep Silwal · Erik Waingarten
Abstract:
Abstract:Randomized dimensionality reduction is a widelyused algorithmic technique for speeding up large-scale Euclidean optimization problems. In this paper, we study dimension reduction for a variety of maximization problems, including max-matching, max-spanning tree, as well as various measures for dataset diversity. For these problems, we show that the effect of dimension reduction is intimately tied to the *doubling dimension* $\lambda_X$ of the underlying dataset $X$---a quantity measuring intrinsic dimensionality of point sets. Specifically, the dimension required is $O(\lambda_X)$, which we also show is necessary for some of these problems. This is in contrast to classical dimension reduction results, whose dependence grow with the dataset size $|X|$. We also provide empirical results validating the quality of solutions found in the projected space, as well as speedups due to dimensionality reduction.
Paperid:1926
Authors:Rui Xue · Tong Zhao · Neil Shah · Xiaorui Liu
Abstract:
Graph neural networks (GNNs) have demonstrated remarkable success in graph representation learning and various sampling approaches have been proposed to scale GNNs to applications with largescale graphs.A class of promising GNN training algorithms take advantage of historical embeddings to reduce the computation and memory cost while maintaining the model expressiveness of GNNs. However, they incur significant computation bias due to the stale feature history. In this paper, we provide a comprehensive analysis of their staleness and inferior performance on large-scale problems. Motivated by our discoveries, we propose a simple yet highly effective training algorithm (REST) to effectively reduce feature staleness, which leads to significantly improved performance and convergence across varying batch sizes, especially when staleness is predominant. The proposed algorithm seamlessly integrates with existing solutions, boasting easy implementation, while comprehensive experiments underscore its superior performance and efficiency on large-scale benchmarks. Specifically, our improvements to state-of-the-art historical embedding methods result in a 2.7\% and 3.6\% performance enhancement on the ogbn-papers100M and ogbn-products dataset respectively, accompanied by notably accelerated convergence. Our code will be made publicly available upon the acceptance of this work.
Paperid:1927
Authors:Alec Helbling · Tuna Han Salih Meral · Benjamin Hoover · Pinar Yanardag · Polo Chau
Abstract:
Do the rich representations of multimodal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualizedconcept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms. Remarkably, ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 11 other zero-shot interpretability methods on the ImageNet-Segmentation dataset and on a single-class subset of PascalVOC. Our work contributes the first evidence that the representations of multi-modal DiT models like Flux are highly transferable to vision tasks like segmentation, even outperforming multi-modal foundation models like CLIP.
Paperid:1928
Authors:Jie Wen · Yadong Liu · Chengliang Liu · Zhanyan Tang · Yuting He · Yulong Chen · Mu Li
Abstract:
Multiview data involves various data forms, such as multi-feature, multi-sequence and multimodal data, providing rich semantic information for downstream tasks. The inherent challenge of incomplete multi-view missing multi-label learning lies in how to effectively utilize limited supervision and insufficient data to learn discriminative representation. Starting from the sufficiency of multi-view shared information for downstream tasks, we argue that the existing contrastive learning paradigms on missing multi-view data show limited consistency representation learning ability, leading to the bottleneck in extracting multi-view shared information. In response, we propose to minimize task-independent redundant information by pursuing the maximization of cross-view mutual information. Additionally, to alleviate the hindrance caused by missing labels, we develop a dual-branch soft pseudo-label cross-imputation strategy to improve classification performance. Extensive experiments on multiple benchmarks validate our advantages and demonstrate strong compatibility with both missing and complete data.
Paperid:1929
Authors:Junyu Luo · Yuhao Tang · Yiwei Fu · Xiao Luo · Zhizhuo KOU · Zhiping Xiao · Wei Ju · Wentao Zhang · Ming Zhang
Abstract:
Unsupervised graph domain adaptation aims to transfer graph data knowledge from a labeled source domain to an unlabeled target domain while addressing distribution shift. However, existing methods often yield suboptimal results due to the entanglement of causalspurious features and the failure of global alignment strategies. We propose \method{}~(Sparse causal discovery with Generative intervention), a novel approach that achieves stable graph representation transfer through sparse causal modeling and dynamic intervention mechanisms. Specifically, \method{} first constructs a sparse causal graph structure, leveraging mutual information bottleneck constraints to disentangle sparse, stable causal features while compressing domain-dependent spurious correlations through variational inference. To address residual spurious correlations, we innovatively design a generative intervention mechanism that breaks local spurious couplings through cross-domain feature recombination while maintaining causal feature semantic consistency via covariance constraints. Furthermore, to mitigate error accumulation in target domain pseudo-labels, we introduce a category-adaptive dynamic calibration strategy, ensuring stable discriminative learning. Extensive experiments on multiple real-world datasets demonstrate that \method{} significantly outperforms existing baselines.
Paperid:1930
Authors:Akhiad Bercovich · Tomer Ronen · Talor Abramovich · Nir Ailon · Nave Assaf · Mohammed Dabbah · Ido Galil · Amnon Geifman · Yonatan Geifman · Izhak Golan · Netanel Haber · Ehud Karpas · Roi Koren · Itay Levy · Pavlo Molchanov · Shahar Mor · Zach Moshe · Najeeb Nabwani · Omri Puny · Ran Rubin · Itamar Schen · Ido Shahaf · Oren Tropp · Omer Argov · Ran Zilberstein · Ran El-Yaniv
Abstract:
Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption.While increasing parameter counts improves accuracy, it also broadens the gap between stateof-the-art capabilities and practical deployability. We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities.Using neural architecture search (NAS) at a large-scale, Puzzle optimizes models with tens of billions of parameters.Our approach utilizes blockwise local knowledge distillation (BLD) for parallel architecture exploration and employs mixed-integer programming for precise constraint optimization.We showcase our framework’s impact via Llama-3.1-Puzzle-51B-Instruct (Puzzle-51B), a model to be released publicly derived from Llama-3.1-70B-Instruct. Puzzle-51B achieves a 2.17x inference throughput speedup, fitting on a single NVIDIA H100 GPU while retaining 98.4% of the original model's benchmark accuracies. Notably, it is the most accurate model supporting single H100 GPU inference with large batch sizes, despite training on only 45B tokens, far fewer than the 15T used to train Llama-70B.Lastly, we show that lightweight alignment on these derived models allows them to surpass the parent model in specific capabilities.Our work establishes that powerful LLM models can be optimized for efficient deployment with only negligible loss in quality, underscoring that inference performance, not parameter count alone, should guide model selection.
Paperid:1931
Authors:Sihui Dai · Christian Cianfarani · Arjun Bhagoji · Vikash Sehwag · Prateek Mittal
Abstract:
Abstract:Robust training methods typically defend against specific attack types, such as $\ell_p$ attacks with fixed budgets, and rarely account for the fact that defenders may encounter new attacks over time. A natural solution is to adapt the defended model to new adversaries as they arise via finetuning, a method which we call continual robust training (CRT). However, when implemented naively, fine-tuning on new attacks degrades robustness on previous attacks. This raises the question: \textit{how can we improve the initial training and fine-tuning of the model to simultaneously achieve robustness against previous and new attacks?} We present theoretical results which show that the gap in a model's robustness against different attacks is bounded by how far each attack perturbs a sample in the model's logit space, suggesting that regularizing with respect to this logit space distance can help maintain robustness against previous attacks.Extensive experiments on 3 datasets (CIFAR-10, CIFAR-100, and ImageNette) and over 100 attack combinations demonstrate that the proposed regularization improves robust accuracy with little overhead in training time. Our findings and open-source code lay the groundwork for the deployment of models robust to evolving attacks.
Paperid:1932
Authors:Bo Xue · Dake Bu · Ji Cheng · Yuanyu Wan · Qingfu Zhang
Abstract:
Abstract:Reinforcement Learning (RL) with linear transition kernels and reward functions has recently attracted growing attention due to its computational efficiency and theoretical advancements. However, prior theoretical research in RL has primarily focused on singleobjective problems, resulting in limited theoretical development for multi-objective reinforcement learning (MORL). To bridge this gap, we examine MORL under lexicographic reward structures, where rewards comprise $m$ hierarchically ordered objectives. In this framework, the agent the agent maximizes objectives sequentially, prioritizing the highest-priority objective before considering subsequent ones. We introduce the first MORL algorithm with provable regret guarantees. For any objective $i \in \{1, 2, \cdots, m\}$, our algorithm achieves a regret bound of $\widetilde{O}(\Lambda^i(\lambda) \cdot \sqrt{d^2H^4 K})$, where $\Lambda^i(\lambda) = 1 + \lambda + \cdots + \lambda^{i-1}$, $\lambda$ quantifies the trade-off between conflicting objectives, $d$ is the feature dimension, $H$ is the episode length, and $K$ is the number of episodes. Furthermore, our algorithm can be applied in the misspecified setting, where the regret bound for the $i$-th objective becomes $\widetilde{O}(\Lambda^i(\lambda)\cdot(\sqrt{d^2H^4K}+\epsilon dH^2K))$, with $\epsilon$ denoting the degree of misspecification.
Paperid:1933
Authors:Brian Mak · Jeffrey Flanigan
Abstract:
The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties.
Paperid:1934
Authors:Yi Xie · Zhanke Zhou · Chentao Cao · Qiyu Niu · Tongliang Liu · Bo Han
Abstract:
Large Language Models (LLMs) exhibit strong reasoning capabilities, which can be further enhanced through multiagent frameworks. However, existing multi-agent methods often suffer from high computational costs and lack theoretical convergence guarantees. To address these limitations, we introduce an incomplete information perspective to enhance the scalability of multiple LLMs by modeling them with Bayesian Nash Equilibrium (BNE) and propose \textbf{E}fficient \textbf{Co}ordination via \textbf{N}ash Equilibrium (ECON), a hierarchical reinforcement learning framework. EcoNash guides multiple LLMs to achieve BNE by integrating distributed reasoning and centralized commitment, ensuring that each LLM independently generates optimal answers based on its own beliefs without the need for extensive inter-agent communication. Theoretically, we prove that our framework achieves a regret bound of ( O\left( \nicefrac{N\sqrt{T}}{1 - \gamma} \right) ), which grows sublinearly with ( T ), while multi-agent frameworks that do not attain BNE can at best achieve ( O\left( \nicefrac{\delta_{\max} T}{1 - \gamma} \right) ). Empirically, our method outperforms single-LLM approaches by 10.9\% and surpasses existing multi-LLM methods by 11.2\% over six benchmark tests covering complex reasoning and planning tasks on average. Additionally, scalability experiments show that our approach efficiently integrates more models, confirming its flexibility and scalability, potentially leading to larger multi-LLM ensembles.
Paperid:1935
Authors:Ohad Einav · Nir Rosenfeld
Abstract:
Machine learning models play a key role for service providers looking to gain market share in consumer markets. However, traditional learning approaches do not take into account the existence of additional providers, who compete with each other for consumers. Our work aims to study learning in this market setting, as it affects providers, consumers, and the market itself. We begin by analyzing such markets through the lens of the learning objective, and show that accuracy cannot be the only consideration. We then propose a method for classification under competition, so that a learner can maximize market share in the presence of competitors. We show that our approach benefits the providers as well as the consumers, and find that the timing of market entry and model updates can be crucial. We display the effectiveness of our approach across a range of domains, from simple distributions to noisy datasets, and show that the market as a whole remains stable by converging quickly to an equilibrium.
Paperid:1936
Authors:Yu Zhu · Chunfeng Song · Wanli Ouyang · Shan Yu · Tiejun Huang
Abstract:
Individual brains exhibit striking structural and physiological heterogeneity, yet neural circuits can generate remarkably consistent functional properties across individuals, an apparent paradox in neuroscience. While recent studies have observed preserved neural representations in motor cortex through manual alignment across subjects, the zeroshot validation of such preservation and its generalization to more cortices remain unexplored. Here we present PNBA (Probabilistic Neural-Behavioral Representation Alignment), a new framework that leverages probabilistic modeling to address hierarchical variability across trials, sessions, and subjects, with generative constraints preventing representation degeneration. By establishing reliable cross-modal representational alignment, PNBA reveals robust preserved neural representations in monkey primary motor cortex (M1) and dorsal premotor cortex (PMd) through zero-shot validation. We further establish similar representational preservation in mouse primary visual cortex (V1), reflecting a general neural basis. These findings resolve the paradox of neural heterogeneity by establishing zero-shot preserved neural representations across cortices and species, enriching neural coding insights and enabling zero-shot behavior decoding.
Paperid:1937
Authors:Qingpo Wuwu · Chonghan Gao · Tianyu Chen · Yihang Huang · Yuekai Zhang · Jianing Wang · Jianxin Li · Haoyi Zhou · Shanghang Zhang
Abstract:
Solving partial differential equations (PDEs) using neural methods has been a longstanding scientific and engineering research pursuit. Physics-Informed Neural Networks (PINNs) have emerged as a promising alternative to traditional numerical methods for solving PDEs. However, the gap between domain-specific knowledge and deep learning expertise often limits the practical application of PINNs. Previous works typically involve manually conducting extensive PINNs experiments and summarizing heuristic rules for hyperparameter tuning. In this work, we introduce \textit{PINNsAgent}, a novel surrogation framework that leverages large language models (LLMs) and utilizes PINNs as a foundation to bridge the gap between domain-specific knowledge and deep learning. Specifically, \textit{PINNsAgent} integrates (1) Physics-Guided Knowledge Replay (PGKR), which encodes the essential characteristics of PDEs and their associated best-performing PINNs configurations into a structured format, enabling efficient knowledge transfer from solved PDEs to similar problems and (2) Memory Tree Reasoning, a strategy that effectively explores the search space for optimal PINNs architectures. By leveraging LLMs and exploration strategies, \textit{PINNsAgent} enhances the automation and efficiency of PINNs-based solutions. We evaluate \textit{PINNsAgent} on 12 benchmark PDEs, demonstrating its effectiveness in automating the surrogation process and significantly improving the efficiency and accuracy of PINNs-based solutions.
Paperid:1938
Authors:Kaiying Hou · Eran Malach · Samy Jelassi · David Brandfonbrener · Sham Kakade
Abstract:
Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chainof-Thought (CoT) techniques, we proposeTuring Programs, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. Finally, we theoretically prove that transformers can implement Turing Programs, constructing a simple RASP (Weiss et al.) program that simulates an arbitrary Turing machine.
Paperid:1939
Authors:Rushuang Zhou · Yuanting Zhang · Yining Dong
Abstract:
Finetuning large-scale pre-trained models provides an effective solution to alleviate the label scarcity problem in cardiovascular diseases (CVDs) detection using electrocardiogram (ECG). However, as the pre-trained models scale up, the computational costs for fine-tuning and inference become unaffordable on low-level devices deployed for clinical applications. Additionally, maintaining the model performance under low budgets in computational resources remains a significant challenge. However, a comprehensive study that can address them in a joint framework is still lacking. Here, we propose a holistic method (H-Tuning) for low-cost and efficient fine-tuning of pre-trained models on downstream datasets. Then, the inference costs of the models fine-tuned by H-Tuning are further reduced significantly using a knowledge distillation technique. Experiments on four ECG datasets demonstrate that the proposed framework reduces the GPU memory consumption during fine-tuning by 6.34 times while achieving comparable CVDs detection performance to standard fine-tuning. With the knowledge distillation technique, the model inference latency and the memory consumption are reduced by 19.8 times and 4.5 times. As such, the proposed joint framework allows for the utilization of pre-trained models with high computation efficiency and robust performance, exploring a path toward low-cost and efficient CVDs detection.
Paperid:1940
Authors:Divya Shyamal · Jiaqi Zhang · Caroline Uhler
Abstract:
Abstract:A _combinatorial intervention_, consisting of multiple treatments applied to a single unit with potential interactive effects, has substantial applications in fields such as biomedicine, engineering, and beyond. Given $p$ possible treatments, conducting all possible $2^p$ combinatorial interventions can be laborious and quickly becomes infeasible as $p$ increases. Here we introduce the _probabilistic factorial experimental design_, formalized from how scientists perform lab experiments. In this framework, the experimenter selects a dosage for each possible treatment and applies it to a group of units. Each unit independently receives a random combination of treatments, sampled from a product Bernoulli distribution determined by the dosages. Additionally, the experimenter can carry out such experiments over multiple rounds, adapting the design in an active manner. We address the optimal experimental design problem within a novel intervention model that imposes boundeddegree interactions between treatments. In the passive setting, we provide a closed-form solution for the near-optimal design. Our results prove that a dosage of $\frac{1}{2}$ for each treatment is optimal up to a factor of $1+O(\frac{\ln(n)}{n})$ for estimating any $k$-way interaction model, regardless of $k$, and imply that $O\big(kp^{3k}\ln(p)\big)$ observations are required to accurately estimate this model. For the multi-round setting, we provide a near-optimal acquisition function that can be numerically optimized. We also explore several extensions of the design problem and finally validate our findings through simulations.
Paperid:1941
Authors:Yu Wang · Mazdak Abulnaga · Yaël Balbastre · Bruce Fischl
Abstract:
Solving sparse linear systems and/or partial differential equations (PDEs) is fundamental to computing and learning problems in science and engineering. We substantially improve the speed of \emph{direct} sparse linear solvers by orders of magnitude, violating the common assumption that direct solvers are too expensive. The runtime on images is reduced to tens of millisecondssuitable for interactive applications in vision, graphics, and scientific computing. We ``condense'' a sparse Laplacian matrix into a dense tensor that batch-wise stores the Dirichlet-to-Neumann matrices, reducing the sparse solve to sequentially merging pairs of dense matrices that are much smaller. These small dense systems are batched and inverted in parallel, taking advantage of dense GPU BLAS kernels, highly optimized in the era of deep learning. Methods involving linear solvers can immediately benefit from our approach, including solvers for PDEs and FEM (finite element methods). In addition, our method is differentiable and hence enables efficient integration into machine learning pipelines, and serves as an efficient zero-shot baseline for learning-based PDE solvers.
Paperid:1942
Authors:Renze Lou · Hanzi Xu · Sijia Wang · Jiangshu Du · Ryo Kamoi · Xiaoxin Lu · Jian Xie · Yuxuan Sun · Yusen Zhang · Jihyun Ahn · Hongchao Fang · Zhuoyang Zou · Wenchao Ma · Xi Li · Kai Zhang · Congying Xia · Lifu Huang · Wenpeng Yin
Abstract:
Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; and (iii) PaperWeakness, identifying weaknesses in paper submissions. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will release the AAAR-1.0 and keep iterating it to new versions.
Paperid:1943
Authors:Chenyu Li · Oscar Michel · Xichen Pan · Sainan Liu · Mike Roberts · Saining Xie
Abstract:
Largescale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. We show state-of-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, we find that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and we can further improve results through a novel reward modeling procedure we introduce. Our study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, we release a benchmark for this task that may serve as a useful diagnostic tool for measuring the emergence of accurate physics modeling in video generative models.
Paperid:1944
Authors:Saehyung Lee · Seunghyun Yoon · Trung Bui · Jing Shi · Sungroh Yoon
Abstract:
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLMMLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that the proposed evaluation method aligns better with human judgments of factuality than existing metrics. Moreover, we show that current approaches for enhancing MLLM factuality often fail in hyper-detailed image captioning tasks. In contrast, our approach significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM's performance on VQA benchmarks may not correlate with its ability to generate detailed image captions.
Paperid:1945
Authors:Grigoriy Ksenofontov · Aleksandr Korotin
Abstract:
Abstract:The Schrödinger Bridge (SB) is a powerful framework for solving generative modeling tasks such as unpaired domain translation. Most SBrelated research focuses on continuous data space $\mathbb{R}^{D}$ and leaves open theoretical and algorithmic questions about applying SB methods to discrete data, e.g, on finite spaces $\mathbb{S}^{D}$. Notable examples of such sets $\mathbb{S}$ are codebooks of vector-quantized (VQ) representations of modern autoencoders, tokens in texts, categories of atoms in molecules, etc. In this paper, we provide a theoretical and algorithmic foundation for solving SB in discrete spaces using the recently introduced Iterative Markovian Fitting (IMF) procedure. Specifically, we theoretically justify the convergence of discrete-time IMF (D-IMF) to SB in discrete spaces. This enables us to develop a practical computational algorithm for SB which we call Categorical Schrödinger Bridge Matching (CSBM). We show the performance of CSBM via a series of experiments with synthetic data and VQ representations of images.
Paperid:1946
Authors:Daniel Marczak · Simone Magistri · Sebastian Cygert · Bartłomiej Twardowski · Andrew Bagdanov · Joost van de Weijer
Abstract:
Model merging integrates the weights of multiple taskspecific models into a single multi-task model. Despite recent interest in the problem, a significant performance gap between the combined and single-task models remains. In this paper, we investigate the key characteristics of task matrices -- weight update matrices applied to a pre-trained model -- that enable effective merging. We show that alignment between singular components of task-specific and merged matrices strongly correlates with performance improvement over the pre-trained model. Based on this, we propose an isotropic merging framework that flattens the singular value spectrum of task matrices, enhances alignment, and reduces the performance gap. Additionally, we incorporate both common and task-specific subspaces to further improve alignment and performance. Our proposed approach achieves state-of-the-art performance across multiple scenarios, including various sets of tasks and model scales. This work advances the understanding of model merging dynamics, offering an effective methodology to merge models without requiring additional training.
Paperid:1947
Authors:Yaoxiang Wang · Haoling Li · Xin Zhang · Jie Wu · Xiao Liu · Wenxiang Hu · Zhongxin Guo · Yangyu Huang · Ying Xin · Yujiu Yang · Jinsong Su · Qi Chen · Scarlett Li
Abstract:
Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature treebased synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtainEpiCoderseries, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data.
Paperid:1948
Authors:Alberto Sinigaglia · Davide Sartor · Gian Antonio Susto
Abstract:
Imposing inputoutput constraints in multi-layer perceptrons (MLPs) plays a pivotal role in many real world applications. Monotonicity in particular is a common requirement in applications that need transparent and robust machine learning models. Conventional techniques for imposing monotonicity in MLPs by construction involve the use of non-negative weight constraints and bounded activation functions, which poses well known optimization challenges. In this work, we generalize previous theoretical results, showing that MLPs with non-negative weight constraint and activations that saturate on alternating sides are universal approximators for monotonic functions. Additionally, we show an equivalence between saturation side in the activations and sign of the weight constraint. This connection allows us to prove that MLPs with convex monotone activations and non-positive constrained weights also qualify as universal approximators, in contrast to their non-negative constrained counterparts. This results provide theoretical grounding to the empirical effectiveness observed in previous works, while leading to possible architectural simplification. Moreover, to further alleviate the optimization difficulties, we propose an alternative formulation that allows the network to adjust its activations according to the sign of the weights. This eliminates the requirement for weight reparameterization, easing initialization and improving training stability. Experimental evaluation reinforce the validity of the theoretical results, showing that our novel approach compares favorably to traditional monotonic architectures.
Paperid:1949
Authors:Yangyi Shen · Jincheng Zhou · Beatrice Bevilacqua · Joshua Robinson · Charilaos Kanatsoulis · Jure Leskovec · Bruno Ribeiro
Abstract:
Traditional Graph Neural Networks (GNNs) cannot generalize to new graphs with node attributes different from the training ones, making zeroshot generalization across different node attribute domains an open challenge in graph machine learning. In this paper, we propose STAGE, which encodesstatistical dependenciesbetween attributes rather than individual attribute values, which may differ in test graphs. By assuming these dependencies remain invariant under changes in node attributes, STAGE achieves provable generalization guarantees for a family of domain shifts. Empirically, STAGE demonstrates strong zero-shot performance on medium-sized datasets: when trained on multiple graph datasets with different attribute spaces (varying in types and number) and evaluated on graphs with entirely new attributes, STAGE achieves a relative improvement in Hits@1 between 40% to 103% in link prediction and a 10% improvement in node classification compared to state-of-the-art baselines.
Paperid:1950
Authors:Haifeng Wen · Hong XING · Osvaldo Simeone
Abstract:
Posthoc calibration of pre-trained models is critical for ensuring reliable inference, especially in safety-critical domains such as healthcare. Conformal Prediction (CP) offers a robust post-hoc calibration framework, providing distribution-free statistical coverage guarantees for prediction sets by leveraging held-out datasets. In this work, we address a decentralized setting where each device has limited calibration data and can communicate only with its neighbors over an arbitrary graph topology. We propose two message-passing-based approaches for achieving reliable inference via CP: quantile-based distributed conformal prediction (Q-DCP) and histogram-based distributed conformal prediction (H-DCP). Q-DCP employs distributed quantile regression enhanced with tailored smoothing and regularization terms to accelerate convergence, while H-DCP uses a consensus-based histogram estimation approach. Through extensive experiments, we investigate the trade-offs between hyperparameter tuning requirements, communication overhead, coverage guarantees, and prediction set sizes across different network topologies.
Paperid:1951
Authors:Quan Nguyen · Minh Vu · Truc Nguyen · My T. Thai
Abstract:
Federated Learning (FL) enables collaborative learning among clients via a coordinating server while avoiding direct data sharing, offering a perceived solution to preserve privacy. However, recent studies on Membership Inference Attacks (MIAs) have challenged this notion, showing high success rates against unprotected training data. While local differential privacy (LDP) is widely regarded as a gold standard for privacy protection in data analysis, most studies on MIAs either neglect LDP or fail to provide theoretical guarantees for attack success against LDPprotected data. To address this gap, we derive theoretical lower bounds for the success rates of low-polynomial-time MIAs that exploit vulnerabilities in fully connected or self-attention layers, regardless of the LDP mechanism used. We establish that even when data are protected by LDP, privacy risks persist, depending on the privacy budget. Practical evaluations on models like ResNet and Vision Transformer confirm considerable privacy risks, revealing that the noise required to mitigate these attacks significantly degrades models' utility.
Paperid:1952
Authors:Subhodh Kotekal
Abstract:
Abstract:Given independent and identically distributed data from a compactly supported, $\alpha$Hölder density $f$, we study estimation of the Fisher information of the Gaussian-smoothed density $f*\varphi_t$, where $\varphi_t$ is the density of $N(0, t)$. We derive the minimax rate including the sharp dependence on $t$ and show some simple, plug-in type estimators are optimal for $t > 0$, even though extra debiasing steps are widely employed in the literature to achieve the sharp rate in the unsmoothed ($t = 0$) case. Due to our result's sharp characterization of the scaling in $t$, plug-in estimators of the mutual information and entropy are shown to achieve the parametric rate by way of the I-MMSE and de Bruijn's identities.
Paperid:1953
Authors:Xue Zhao · Qinying Gu · Xinbing Wang · Chenghu Zhou · Nanyang Ye
Abstract:
Improving the generalization of multicamera 3D object detection is essential for safe autonomous driving in the real world. In this paper, we consider a realistic yet more challenging scenario, which aims to improve the generalization when only single source data available for training, as gathering diverse domains of data and collecting annotations is not only time-consuming and labor-intensive. To this end, we propose the Fourier Cross-View Learning (FCVL) framework including Fourier Hierarchical Augmentation (FHiAug), an augmentation strategy in the frequency domain to boost domain diversity, and Fourier Cross-View Semantic Consistency Loss to facilitate the model to learn more domain-invariant features from adjacent perspectives. Furthermore, we provide theoretical guarantees via augmentation graph theory. To the best of our knowledge, this is the first study to explore generalizable multi-camera 3D object detection with a single source. Extensive experiments on various testing domains have demonstrated that our approach achieves the best performance across various domain generalization methods.
Paperid:1954
Authors:Antonia Wüst · Tim Tobiasch · Lukas Helff · Inga Ibs · Wolfgang Stammer · Devendra Dhami · Constantin Rothkopf · Kristian Kersting
Abstract:
Recently, newly developed VisionLanguage Models (VLMs), such as OpenAI's o1, have emerged, seemingly demonstrating advanced reasoning capabilities across text and image modalities. However, the depth of these advances in language-guided perception and abstract reasoning remains underexplored, and it is unclear whether these models can truly live up to their ambitious promises. To assess the progress and identify shortcomings, we enter the wonderland of Bongard problems, a set of classic visual reasoning puzzles that require human-like abilities of pattern recognition and abstract reasoning. With our extensive evaluation setup, we show that while VLMs occasionally succeed in identifying discriminative concepts and solving some of the problems, they frequently falter. Surprisingly, even elementary concepts that may seem trivial to humans, such as simple spirals, pose significant challenges. Moreover, when explicitly asked to recognize ground truth concepts, they continue to falter, suggesting not only a lack of understanding of these elementary visual concepts but also an inability to generalize to unseen concepts. We compare the results of VLMs to human performance and observe that a significant gap remains between human visual reasoning capabilities and machine cognition.
Paperid:1955
Authors:Jonas Möller · Lukas Pirch · Felix Weissberg · Sebastian Baunsgaard · Thorsten Eisenhofer · Konrad Rieck
Abstract:
Linear algebra is a cornerstone of neural network inference. The efficiency of popular frameworks, such as TensorFlow and PyTorch, critically depends on backend libraries providing highly optimized matrix multiplications and convolutions. A diverse range of these backends exists across platforms, including Intel MKL, Nvidia CUDA, and Apple Accelerate. Although these backends provide equivalent functionality, subtle variations in their implementations can lead to seemingly negligible differences during inference. In this paper, we investigate these minor discrepancies and demonstrate how they can be selectively amplified by adversaries. Specifically, we introduceChimera examples, inputs to models that elicit conflicting predictions depending on the employed backend library. These inputs can even be constructed with integer values, creating a vulnerability exploitable from realworld input domains. We analyze the prevalence and extent of the underlying attack surface and propose corresponding defenses to mitigate this threat.
Paperid:1956
Authors:Lijie Hu · Chenyang Ren · Zhengyu Hu · Hongbin Lin · Chenglong Wang · Zhen Tan · Weimin Lyu · Jingfeng ZHANG · Hui Xiong · Di Wang
Abstract:
Concept Bottleneck Models (CBMs) have garnered much attention for their ability to elucidate the prediction process through a humanunderstandable concept layer. However, most previous studies focused on cases where the data, including concepts, are clean. In many scenarios, we always need to remove/insert some training data or new concepts from trained CBMs due to different reasons, such as privacy concerns, data mislabelling, spurious concepts, and concept annotation errors. Thus, the challenge of deriving efficient editable CBMs without retraining from scratch persists, particularly in large-scale applications. To address these challenges, we propose Editable Concept Bottleneck Models (ECBMs). Specifically, ECBMs support three different levels of data removal: concept-label-level, concept-level, and data-level. ECBMs enjoy mathematically rigorous closed-form approximations derived from influence functions that obviate the need for re-training. Experimental results demonstrate the efficiency and effectiveness of our ECBMs, affirming their adaptability within the realm of CBMs.
Paperid:1957
Authors:Xiang Fu · Brandon Wood · Luis Barroso-Luque · Daniel S. Levine · Meng Gao · Misko Dzamba · Larry Zitnick
Abstract:
Machine learning interatomic potentials (MLIPs) have become increasingly effective at approximating quantum mechanical calculations at a fraction of the computational cost. However, lower errors on held out test sets do not always translate to improved results on downstream physical property prediction tasks. In this paper, we propose testing MLIPs on their practical ability to conserve energy during molecular dynamic simulations. If passed, improved correlations are found between test errors and their performance on physical property prediction tasks. We identify choices which may lead to models failing this test, and use these observations to improve upon highlyexpressive models. The resulting model, eSEN, provides state-of-the-art results on a range of physical property prediction tasks, including materials stability prediction, thermal conductivity prediction, and phonon calculations.
Paperid:1958
Authors:Haotian Ni · Yake Wei · Hang Liu · Gong Chen · Chong Peng · Hao Lin · Di Hu
Abstract:
Multimodal learning integrates data from multiple modalities, yet effectively fusing them remains a significant challenge, particularly in scenarios where modality quality varies across samples. Dynamic fusion strategies, such as the attention mechanism in Transformers, aim to address this challenge by adaptively assigning proper weights to modalities based on input data. However, in this work, we observe that the dynamic property of attentionbased fusion diminishes when tokens from one modality become biased. This bias triggers a self-reinforcing cycle during training, where the model progressively overemphasizes the biased modality, leading to a widening distribution gap in attention keys across modalities. This gap, in turn, deactivates the dynamic adaptability of the attention mechanism. To address this issue, we propose the Query Rebalance Rotation (QRR) algorithm, which rebalances the query to effectively disrupt the self-reinforcing cycle and reduce the key distribution gap. By restoring balance in attention allocation, QRR revives the cooperative dynamics in the multimodal Transformer, allowing them to better handle modality quality variations. Extensive experiments validate the effectiveness of QRR and the restoration of dynamic adaptability is pivotal for advancing the broader capabilities of multimodal Transformers.
Paperid:1959
Authors:Yi Cai · Thibaud Ardoin · Gerhard Wunder
Abstract:
Feature attribution explains machine decisions by quantifying each feature's contribution.While numerous approaches rely on exact gradient measurements, recent work has adopted gradient estimation to derive explanatory information under querylevel access, a restrictive yet more practical accessibility assumption known as the black-box setting.Following this direction, this paper introduces GEFA (Gradient-estimation-based Explanation For All), a general feature attribution framework leveraging proxy gradient estimation.Unlike the previous attempt that focused on explaining image classifiers, the proposed explainer derives feature attributions in a proxy space, making it generally applicable to arbitrary black-box models, regardless of input type.In addition to its close relationship with Integrated Gradients, our approach, a path method built upon estimated gradients, surprisingly produces unbiased estimates of Shapley Values.Compared to traditional sampling-based Shapley Value estimators, GEFA avoids potential information waste sourced from computing marginal contributions, thereby improving explanation quality, as demonstrated in quantitative evaluations across various settings.
Paperid:1960
Authors:Ron Tsibulsky · Daniel Nevo · Uri Shalit
Abstract:
Despite the impressive advancements in modern machine learning, achieving robustness in Domain Generalization (DG) tasks remains a significant challenge. In DG, models are expected to perform well on samples from unseen test distributions (also called domains), by learning from multiple related training distributions. Most existing approaches to this problem rely on singlevalued predictions, which inherently limit their robustness. We argue that set-valued predictors could be leveraged to enhance robustness across unseen domains, while also taking into account that these sets should be as small as possible. We introduce a theoretical framework defining successful set prediction in the DG setting, focusing on meeting a predefined performance criterion across as many domains as possible, and provide theoretical insights into the conditions under which such domain generalization is achievable. We further propose a practical optimization method compatible with modern learning architectures, that balances robust performance on unseen domains with small prediction set sizes. We evaluate our approach on several real-world datasets from the WILDS benchmark, demonstrating its potential as a promising direction for robust domain generalization.
Paperid:1961
Authors:David Geissbühler · Hatef Otroshi Shahreza · Sébastien Marcel
Abstract:
Face recognition models are trained on largescale datasets, which have privacy and ethical concerns. Lately, the use of synthetic data to complement or replace genuine data for the training of face recognition models has been proposed. While promising results have been obtained, it still remains unclear if generative models can yield diverse enough data for such tasks. In this work, we introduce a new method, inspired by the physical motion of soft particles subjected to stochastic Brownian forces, allowing us to sample identities distributions in a latent space under various constraints. We introduce three complementary algorithms, called Langevin, Dispersion, and DisCo, aimed at generating large synthetic face datasets. With this in hands, we generate several face datasets and benchmark them by training face recognition models, showing that data generated with our method exceeds the performance of previously GAN-based datasets and achieves competitive performance with state-of-the-art diffusion-based synthetic datasets. While diffusion models are shown to memorize training data, we prevent leakage in our new synthetic datasets, paving the way for more responsible synthetic datasets.
Paperid:1962
Authors:Yaowenhu · Wenxuan Tu · Yue Liu · Xinhang Wan · Junyi Yan · Taichun Zhou · Xinwang Liu
Abstract:
Abstract:Deep graph clustering (DGC), which aims to unsupervisedly separate the nodes in an attribute graph into different clusters, has seen substantial potential in various industrial scenarios like community detection and recommendation. However, the realworld attribute graphs, e.g., social networks and user-item interactions, are usually large-scale and attribute-missing. To solve these two problems, we propose a novel DGC method termed Complementary Multi-View Neighborhood Differentiation (CMV-ND), by pre-processing graph structural information to multiple views in a complete but non-redundant manner. Firstly, to ensure the completeness of the structural information, we propose a Recursive Neighborhood Search to recursively explore the graph's local structure by completely expanding node neighborhoods across different hop distances. Secondly, to eliminate the redundancy between neighborhoods at different hops, we introduce a Neighborhood Differential Strategy that ensures no overlapping nodes between the differential hop representations. Then, we construct $K+1$ complementary views from the $K$ differential hop representations and the features of the target node. Last, we apply existing multi-view clustering or DGC methods to the views. Experimental results on six widely used graph datasets demonstrate that CMV-ND significantly improves the performance of various methods.
Paperid:1963
Authors:Zixuan Hu · Yichun Hu · Xiaotong Li · SHIXIANG TANG · LINGYU DUAN
Abstract:
Wild TestTime Adaptation (WTTA) is proposed to adapt a source model to unseen domains under extreme data scarcity and multiple shifts. Previous approaches mainly focused on sample selection strategies, while overlooking the fundamental problem on underlying optimization. Initially, we critically analyze the widely-adopted entropy minimization framework in WTTA and uncover its significant limitations in noisy optimization dynamics that substantially hinder adaptation efficiency. Through our analysis, we identifyregion confidenceas a superior alternative to traditional entropy, however its direct optimization remains computationally prohibitive for real-time applications. In this paper, we introduce a novel region-integrated methodReCAPthat bypasses the lengthy process. Specifically, we propose a dynamic region modification scheme that adapts to evolving model space changes. Subsequently, we develop afinite-to-infiniteasymptotic approximation that transforms the intractable region confidence into a tractable and upper-bounded proxy. These innovations significantly unlock the overlooked potential dynamics in local region in a concise solution. Our extensive experiments demonstrate the consistent superiority of ReCAP over existing methods across various datasets and wild scenarios. The code will be publicly available.
Paperid:1964
Authors:Sanjeev Raja · Martin Šípka · Michael Psenka · Tobias Kreiman · Michal Pavelka · Aditi Krishnapriyan
Abstract:
Transition path sampling (TPS), which involves finding probable paths connecting two points on an energy landscape, remains a challenge due to the complexity of realworld atomistic systems. Current machine learning approaches rely on expensive training procedures and under-utilize growing quantities of atomistic data, limiting scalability and generalization. Generative models of atomistic conformational ensembles sample temporally independent states from energy landscapes, but their application to TPS remains mostly unexplored. In this work, we address TPS by interpreting candidate paths as trajectories sampled from stochastic dynamics induced by the learned score function of generative models, namely denoising diffusion and flow matching. Under these dynamics, finding high-likelihood transition paths becomes equivalent to minimizing the Onsager-Machlup (OM) action functional, enabling us to repurpose pre-trained generative models for TPS in a zero-shot fashion. We demonstrate our approach on a Müller-Brown potential and several fast-folding proteins, where we obtain diverse, physically realistic transition pathways, as well as tetrapeptides, where we demonstrate successful TPS on systems not seen by the generative model during training. Our method can be easily incorporated into new generative models, making it practically relevant as models continue to scale and improve.
Paperid:1965
Authors:Ganyu Wang · Jinjie Fang · Maxwell (Juncheng) Yin · Bin Gu · Xi Chen · Boyu Wang · Yi Chang · Charles X. Ling
Abstract:
BlackBox Discrete Prompt Learning (BDPL) is a prompt-tuning method that optimizes discrete prompts without accessing model parameters or gradients, making the prompt tuning on a cloud-based Large Language Model (LLM) feasible.Adapting Federated Learning (FL) to BDPL could further enhance prompt tuning performance by leveraging data from diverse sources. However, all previous research on federated black-box prompt tuning had neglected the substantial query cost associated with the cloud-based LLM service. To address this gap, we conducted a theoretical analysis of query efficiency within the context of federated black-box prompt tuning. Our findings revealed that degrading FedAvg to activate only one client per round, a strategy we called \textit{FedOne}, enabled optimal query efficiency in federated black-box prompt learning. Building on this insight, we proposed the FedOne framework, a federated black-box discrete prompt learning method designed to maximize query efficiency when interacting with cloud-based LLMs.We conducted numerical experiments on various aspects of our framework, demonstrating a significant improvement in query efficiency, which aligns with our theoretical results.
Paperid:1966
Authors:Haebin Shin · ji · Xiao Liu · Yeyun Gong
Abstract:
Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabularyagnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.
Paperid:1967
Authors:Marvin Li · Aayush Karan · Sitan Chen
Abstract:
Large language models can exhibit undesirable and unexpected behavior in the blink of an eye. In a recent Anthropic demo, Claude switched from coding to Googling pictures of Yellowstone, and these sudden shifts in behavior have also been observed in reasoning patterns and jailbreaks. This phenomenon is not unique to autoregressive models: in diffusion models, key features of the final output are decided in narrow “critical windows” of the generation process. In this work we develop a simple, unifying theory to explain this phenomenon. We show that it emerges generically as the generation process localizes to a subpopulation of the distribution it models. While critical windows have been studied at length in diffusion models, existing theory heavily relies on strong distributional assumptions and the particulars of Gaussian diffusion. In contrast to existing work our theory (1) applies to autoregressive and diffusion models; (2) makes no distributional assumptions; (3) quantitatively improves previous bounds even when specialized to diffusions; and (4) requires basic tools and no Girsanov or statisticalphysics-based machinery.
Paperid:1968
Authors:Haoran Luo · Haihong E · Yikai Guo · Qika Lin · Xiaobao Wu · Xinyu Mu · Wenhao Liu · Meina Song · Yifan Zhu · Anh Tuan Luu
Abstract:
Knowledge Base Question Answering (KBQA) aims to answer natural language questions with a largescale structured knowledge base (KB). Despite advancements with large language models (LLMs), KBQA still faces challenges in weak KB awareness, imbalance between effectiveness and efficiency, and high reliance on annotated data. To address these challenges, we propose KBQA-o1, a novel agentic KBQA method with Monte Carlo Tree Search (MCTS). It introduces a ReAct-based agent process for stepwise logical form generation with KB environment exploration. Moreover, it employs MCTS, a heuristic search method driven by policy and reward models, to balance agentic exploration's performance and search space. With heuristic exploration, KBQA-o1 generates high-quality annotations for further improvement by incremental fine-tuning. Experimental results show that KBQA-o1 outperforms previous low-resource KBQA methods with limited annotated data, boosting Llama-3.1-8B model's GrailQA F1 performance to 78.5% compared to 48.5% of the previous sota method with GPT-3.5-turbo.
Paperid:1969
Authors:Rohan Ghuge · Vidya Muthukumar · Sahil Singla
Abstract:
Abstract:We study *online multicalibration*, a framework for ensuring calibrated predictions across multiple groups in adversarial settings, across $T$ rounds. Although online calibration is typically studied in the $\ell_1$ norm, prior approaches to online multicalibration have taken the indirect approach of obtaining rates in other norms (such as $\ell_2$ and $\ell_{\infty}$) and then transferred these guarantees to $\ell_1$ at additional loss. In contrast, we propose a direct method that achieves improved and oracleefficient rates of $\widetilde{\mathcal{O}}(T^{-1/3})$ and $\widetilde{\mathcal{O}}(T^{-1/4})$ respectively, for online $\ell_1$-multicalibration. Our key insight is a novel reduction of online $\ell_1$-multicalibration to an online learning problem with product-based rewards, which we refer to as *online linear-product optimization* ($\mathtt{OLPO}$). To obtain the improved rate of $\widetilde{\mathcal{O}}(T^{-1/3})$, we introduce a linearization of $\mathtt{OLPO}$ and design a no-regret algorithm for this linearized problem. Although this method guarantees the desired sublinear rate (nearly matching the best rate for online calibration), it becomes computationally expensive when the group family $\mathcal{H}$ is large or infinite, since it enumerates all possible groups. To address scalability, we propose a second approach to $\mathtt{OLPO}$ that makes only a polynomial number of calls to an offline optimization (*multicalibration evaluation*) oracle, resulting in *oracle-efficient* online $\ell_1$-multicalibration with a corresponding rate of $\widetilde{\mathcal{O}}(T^{-1/4})$. Our framework also extends to certain infinite families of groups (e.g., all linear functions on the context space) by exploiting a $1$-Lipschitz property of the $\ell_1$-multicalibration error with respect to $\mathcal{H}$.
Paperid:1970
Authors:Harvineet Singh · Fan Xia · Alexej Gossmann · Andrew Chuang · Julian Hong · Jean Feng
Abstract:
Machine learning (ML) models frequently experience performance degradation when deployed in new contexts. Such degradation is rarely uniform: some subgroups may suffer large performance decay while others may not. Understanding where and how large differences in performance arise is critical for designingtargetedcorrective actions that mitigate decay for the most affected subgroups while minimizing any unintended effects. Current approaches do not provide such detailed insight, as they either (i) explain howaverageperformance shifts arise or (ii) identify adversely affected subgroups without insight into how this occurred. To this end, we introduce aSubgroupscanningHierarchicalInferenceFramework for performance drifT(SHIFT). SHIFT first asks "Is there any subgroup with unacceptably large performance decay due to covariate/outcome shifts?" (Where?) and, if so, dives deeper to ask "Can we explain this using more detailed variable(subset)-specific shifts?" (How?). In real-world experiments, we find that SHIFT identifies interpretable subgroups affected by performance decay, and suggests targeted actions that effectively mitigate the decay.
Paperid:1971
Authors:Dongchao Yang · Songxiang Liu · Haohan Guo · Jiankun Zhao · Yuanyuan Wang · Helin Wang · Zeqian Ju · Xubo Liu · Xueyuan Chen · Xu Tan · Xixin Wu · Helen M Meng
Abstract:
Recent advancements in audio language models have underscored the pivotal role of audio tokenization, which converts audio signals into discrete tokens, thereby facilitating the application of language model architectures to the audio domain. In this study, we introduce ALMTokenizer, a novel lowbitrate and semantically rich audio codec tokenizer for audio language models. Prior methods, such as Encodec, typically encode individual audio frames into discrete tokens without considering the use of context information across frames. Unlike these methods, we introduce a novel query-based compression strategy to capture holistic information with a set of learnable query tokens by explicitly modeling the context information across frames. This design not only enables the codec model to capture more semantic information but also encodes the audio signal with fewer token sequences. Additionally, to enhance the semantic information in audio codec models, we introduce the following: (1) A masked autoencoder (MAE) loss, (2) Vector quantization based on semantic priors, and (3) An autoregressive (AR) prediction loss. As a result, ALMTokenizer achieves competitive reconstruction performance relative to state-of-the-art approaches while operating at a lower bitrate. Within the same audio language model framework, ALMTokenizer outperforms previous tokenizers in audio understanding and generation tasks.\footnote{https://almtokenizer.github.io/demo/}
Paperid:1972
Authors:Ning LU · Shengcai Liu · Jiahao Wu · Weiyu CHEN · Zhirui Zhang · Yew Soon ONG · Qi Wang · Ke Tang
Abstract:
Large language models (LLMs) have shown great potential as generalpurpose AI assistants across various domains. To fully leverage this potential in specific applications, many companies provide fine-tuning API services, enabling users to upload their own data for LLM customization. However, fine-tuning services introduce a new safety threat: user-uploaded data, whether harmful or benign, can break the model’s alignment, leading to unsafe outputs. Moreover, existing defense methods struggle to address the diversity of fine-tuning datasets (e.g., varying sizes, tasks), often sacrificing utility for safety or vice versa. To address this issue, we propose Safe Delta, a safety-aware post-training defense method that adjusts the delta parameters (i.e., the parameter change before and after fine-tuning). Specifically, Safe Delta estimates the safety degradation, selects delta parameters to maximize utility while limiting overall safety loss, and applies a safety compensation vector to mitigate residual safety loss. Through extensive experiments on four diverse datasets with varying settings, our approach consistently preserves safety while ensuring that the utility gain from benign datasets remains unaffected.
Paperid:1973
Authors:Jiecheng Lu · Xu Han · Yan Sun · Shihao Yang
Abstract:
We propose a Weighted Autoregressive Varing gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Movingaverage (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.
Paperid:1974
Authors:Rylan Schaeffer · Joshua Kazdan · John Hughes · Jordan Juravsky · Sara Price · Aengus Lynch · Erik Jones · Robert Kirk · Azalia Mirhoseini · Sanmi Koyejo
Abstract:
Abstract:Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts.In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts.We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge?We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own.We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ${\sim}2-4$ orders of magnitude less inference compute.Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.
Paperid:1975
Authors:Michael Albergo · Eric Vanden-Eijnden
Abstract:
We introduce the NonEquilibrium Transport Sampler (NETS), an algorithm for sampling from unnormalized probability distributions. NETS builds on non-equilibrium sampling strategies that transport a simple base distribution into the target distribution in finite time, as pioneered in Neal's annealed importance sampling (AIS). In the continuous-time setting, this transport is accomplished by evolving walkers using Langevin dynamics with a time-dependent potential, while simultaneously evolving importance weights to debias their solutions following Jarzynski's equality. The key innovation of NETS is to add to the dynamics a learned drift term that offsets the need for these corrective weights by minimizing their variance through an objective that can be estimated without backpropagation and provably bounds the Kullback-Leibler divergence between the estimated and target distributions. NETS provides unbiased samples and features a tunable diffusion coefficient that can be adjusted after training to maximize the effective sample size. In experiments on standard benchmarks, high-dimensional Gaussian mixtures, and statistical lattice field theory models, NETS shows compelling performances.
Paperid:1976
Authors:Shira Vansover-Hager · Tomer Koren · Roi Livni
Abstract:
Abstract:We study the outof-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $\Theta(1/\sqrt{n})$ excess population loss given a sample of size $n$, much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $\eta = \Theta(1/\sqrt{n})$, which gives the optimal rate after one pass, can lead to population loss as large as $\Omega(1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $\Theta(1/(\eta T) + \eta \sqrt{T})$, where $T$ is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after $O(n \log n)$ steps. Finally, we also prove a lower bound of $\Omega(\eta \sqrt{n})$ on the generalization gap of one-pass SGD in dimension $d = \smash{\widetilde O}(n)$, improving on recent results of Koren et al. (2022) and Schliserman et al. (2024).
Paperid:1977
Authors:Audrey Huang · Adam Block · Qinghua Liu · Nan Jiang · Akshay Krishnamurthy · Dylan Foster
Abstract:
Recent work on inferencetime alignment has established benefits of increasing inference-time computation in language models, but naively scaling compute through techniques like Best-of-N sampling can cause performance to degrade due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we formalize inference-time alignment as improving a pre-trained policy’s responses for a prompt of interest, given access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policy’s coverage over high-quality responses for performance and compute scaling: (1) We show that Best-of-N alignment with an ideal N can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when N is large, and fails to achieve tight guarantees under more realistic coverage conditions; (2) We introduce InferenceTimePessimism, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing pessimism in the face of uncertainty; we prove that its performance is optimal and scaling-monotonic, i.e., does ot degrade as N increases. We complement our theoretical results with experiments that demonstrate the practicality of our algorithm across a variety of tasks and models.
Paperid:1978
Authors:Benjamin Leblanc · Mathieu Bazinet · Nathaniel D'Amours · Alexandre Drouin · Pascal Germain
Abstract:
Both PACBayesian and Sample Compress learning frameworks have been shown instrumental for deriving tight (non-vacuous) generalization bounds for neural networks. We leverage these results in a meta-learning scheme, relying on a hypernetwork that outputs the parameters of a downstream predictor from a dataset input. The originality of our approach lies in the investigated hypernetwork architectures that encode the dataset before decoding the parameters: (1) a PAC-Bayesian encoder that expresses a posterior distribution over a latent space, (2) a Sample Compress encoder that selects a small sample of the dataset input along with a message from a discrete set, and (3) a hybrid between both approaches motivated by a new Sample Compress theorem handling continuous messages. The latter theorem exploits the pivotal information transiting at the encoder-decoder junction in order to compute generalization guarantees for each downstream predictor obtained by our meta-learning scheme.
Paperid:1979
Authors:Yuhan Ye · Ying Cui · Jingyi Wang
Abstract:
Abstract:We propose an online adaptive sampling algorithm for solving stochastic nonsmooth differenceof-convex (DC) problems under time-varying distributions. At each iteration, the algorithm relies solely on data generated from the current distribution and employs distinct adaptive sampling rates for the convex and concave components of the DC function, a novel design guided by our theoretical analysis. We show that, under proper conditions on the convergence of distributions, the algorithm converges subsequentially to DC critical points almost surely. Furthermore, the sample size requirement of our proposed algorithm matches the results achieved in the smooth case or when a measurable subgradient selector is available, both under static distributions. A key element of this analysis is the derivation of a novel $O(\sqrt{p/n})$ pointwise convergence rate (modulo logarithmic factors) for the sample average approximation of subdifferential mappings, where $p$ is the dimension of the variable and $n$ is the sample size -- a result of independent interest. Numerical experiments confirm that the proposed algorithm is both efficient and effective for addressing stochastic nonsmooth problems.
Paperid:1980
Authors:Yunxia Wang · Fuyuan CAO · Kui Yu · Jiye Liang
Abstract:
Federated causal structure learning aims to discover causal relationships between variables from data stored on individual clients, with privacy concerns. Most existing methods assume identical variable sets across clients and present federated strategies for aggregating local updates on the central server. However, in practice, variables observed by clients are often not identical but overlapping, and nonoverlapping variables may introduce spurious dependencies. Moreover, aggregating local updates typically reflects only the overall quality of a local graph, ignoring the varying importance of relationships between variables within the graph. In this paper, we consider federated causal structure learning with non-identical variable sets, aiming to design an effective strategy for aggregating “correct” and “good” relationships between variables during collaboratively learning. Specifically, we first construct theories to detect spurious dependencies, extracting “correct” relationships for each local graph. Furthermore, we define stable relationships as those that are both “correct” and “good” across multiple graphs, and finally, we propose a two-level priority selection strategy to handle spurious dependencies and aggregate local updates, obtaining a global causal graph over the integrated variables. Experimental results on synthetic, benchmark and real-world data demonstrate the effectiveness of our method.
Paperid:1981
Authors:Zongzhao Li · Jiacheng Cen · Bing Su · Wenbing Huang · Tingyang Xu · Yu Rong · Deli Zhao
Abstract:
Abstract:Accurately predicting 3D structures and dynamics of physical systems remains a critical challenge in scientific applications, due not only to the need for $\mathrm{E}(3)$equivariance but also the requirement to integrate broader external knowledge. Although existing approaches that rely on geometric Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, they often fall in leveraging extensive broader information. Conversely, while direct application of Large Language Models (LLMs) can incorporate such knowledge, they lack spatial reasoning and the equivariance properties. In EquiLLM, we adopt an architecture separation strategy that employs a pretrained LLM as a sophisticated invariant feature processor and a geometric GNNs as an equivariant encoder to handling the equivariant feature. Consequently, an equivariant adapter produces the final invariant \& equivariant output and ensure $\mathrm{E}(3)$-equivariance while leveraging rich domain knowledge from LLMs. Experimental results demonstrate that EquiLLM delivers significant improvements across molecular dynamics simulation, human motion simulation, and antibody design, highlighting its promising generalizability.
Paperid:1982
Authors:Yongjian Zhong · Liao Zhu · Hieu Vu · Bijaya Adhikari
Abstract:
Subgraph neural networks have recently gained prominence for various subgraphlevel predictive tasks. However, existing methods either \emph{1)} apply simple standard pooling over graph convolutional networks, failing to capture essential subgraph properties, or \emph{2)} rely on rigid subgraph definitions, leading to suboptimal performance. Moreover, these approaches fail to model long-range dependencies both between and within subgraphs—a critical limitation, as many real-world networks contain subgraphs of varying sizes and connectivity patterns. In this paper, we propose a novel implicit subgraph neural network, the first of its kind, designed to capture dependencies across subgraphs. Our approach also integrates label-aware subgraph-level information. We formulate implicit subgraph learning as a bilevel optimization problem and develop a provably convergent algorithm that requires fewer gradient estimations than standard bilevel optimization methods. We evaluate our approach on real-world networks against state-of-the-art baselines, demonstrating its effectiveness and superiority.
Paperid:1983
Authors:Carlota Parés Morlans · Michelle Yi · Claire Chen · Sarah A Wu · Rika Antonova · Tobias Gerstenberg · Jeannette Bohg
Abstract:
Tasks that involve complex interactions between objects with unknown dynamics make planning before execution difficult. Instead, these tasks require agents to iteratively improve their actions after actively exploring causes and effects in the environment. For these type of tasks, we propose a method that leverages Bayesian optimization to reason about causality via a PhysicsInformed Kernel to help guide efficient search for the best next action. Experimental results on Virtual Tools and Phyre physical reasoning benchmarks show that our method outperforms state-of-the-art results, requiring fewer actions to reach the goal.
Paperid:1984
Authors:Daeho Um · Sunoh Kim · Jiwoong Park · Jongin Lim · Seong Jin Ahn · Seulki Park
Abstract:
In this paper, we address learning tasks on graphs with missing features, enhancing the applicability of graph neural networks to realworld graph-structured data. We identify a critical limitation of existing imputation methods based on feature propagation: they produce channels with nearly identical values within each channel, and these low-variance channels contribute very little to performance in graph learning tasks. To overcome this issue, we introduce synthetic features that target the root cause of low-variance channel production, thereby increasing variance in these channels. By preventing propagation-based imputation methods from generating meaningless feature values shared across all nodes, our synthetic feature propagation scheme mitigates significant performance degradation, even under extreme missing rates. Extensive experiments demonstrate the effectiveness of our approach across various graph learning tasks with missing features, ranging from low to extremely high missing rates. Additionally, we provide both empirical evidence and theoretical proof to validate the low-variance problem.
Paperid:1985
Authors:Arka Pal · Rahul Thomas · Louai Zahran · Erica Choi · Akilesh Potti · Micah Goldblum
Abstract:
Recent advances in Large Language Models (LLMs) have led to widespread adoption of thirdparty inference services, raising critical privacy concerns. In this work, we introduce a novel reconstruction technique that can recover original prompts from hidden states with nearly perfect accuracy across multiple state-of-the-art LLMs in the increasingly important open-weights setting. Although the attack is conceptually simple, it has not -- to the best of our knowledge -- previously been described nor shown to work practically. Furthermore, our attack remains effective against various permutation and noise-based defenses, challenging assumptions about the security of previously proposed schemes. To address these vulnerabilities, we propose Cascade, a multi-party inference scheme that leverages sharding in the sequence dimension to retain privacy of the user input. Through theoretical analysis and empirical evaluation, we demonstrate that Cascade is secure against both our attack as well as previous methods, while maintaining computational and communication efficiency. Our findings highlight the importance of rigorous security analysis in privacy-preserving LLM inference and offer practical solutions for secure deployment.
Paperid:1986
Authors:Yuhang Li · Ruokai Yin · Donghyun Lee · Shiting Xiao · Priyadarshini Panda
Abstract:
We introduce GPTQv2, a novel finetuningfree quantization method for compressing large-scale transformer architectures.Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model, resulting in a scheme that we callasymmetric calibration. Such a scheme can effectively reduce the quantization error accumulated in previous layers. We analyze this problem using optimal brain compression to derive a close-formed solution. The new solution explicitly minimizes the quantization error as well as the asymmetry alignment error. Furthermore, we utilize various techniques to parallelize the solution calculation, including channel parallelization, neuron decomposition, and Cholesky reformulation for matrix fusion.As a result, GPTQv2 is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization. Remarkably,on a single GPU, we quantize a 405B language transformer as well as EVA-02—the rank first vision transformer that achieves 90\% pretraining Imagenet accuracy.Code is provided in Supplemental.
Paperid:1987
Authors:Pengkang Guo · Bruno Correia · Pierre Vandergheynst · Daniel Probst
Abstract:
Machine learning for protein modeling faces significant challenges due to proteins' inherently dynamic nature, yet most graphbased machine learning methods rely solely on static structural information. Recently, the growing availability of molecular dynamics trajectories provides new opportunities for understanding the dynamic behavior of proteins; however, computational methods for utilizing this dynamic information remain limited. We propose a novel graph representation that integrates both static structural information and dynamic correlations from molecular dynamics trajectories, enabling more comprehensive modeling of proteins. By applying relational graph neural networks (RGNNs) to process this heterogeneous representation, we demonstrate significant improvements over structure-based approaches across three distinct tasks: atomic adaptability prediction, binding site detection, and binding affinity prediction. Our results validate that combining static and dynamic information provides complementary signals for understanding protein-ligand interactions, offering new possibilities for drug design and structural biology applications.
Paperid:1988
Authors:Toon Vanderschueren · Tim Verdonck · M van der Schaar · Wouter Verbeke
Abstract:
Estimating causal effects is crucial in domains like healthcare, economics, and education. Despite advances in machine learning (ML) for estimating conditional average treatment effects (CATE), the practical adoption of these methods remains limited, due to the complexities of implementing, tuning, and validating them. To address these challenges, we formalize the search for an optimal ML pipeline for CATE estimation as a counterfactual Combined Algorithm Selection and Hyperparameter (CASH) optimization. We introduce AutoCATE, the first endto-end, automated solution for CATE estimation. Unlike prior approaches that address only parts of this problem, AutoCATE integrates evaluation, estimation, and ensembling in a unified framework. AutoCATE enables comprehensive comparisons of different protocols, yielding novel insights into CATE estimation and a final configuration that outperforms commonly used strategies. To facilitate broad adoption and further research, we release AutoCATE as an open-source software package.
Paperid:1989
Authors:Yunzhen Feng · Ariel Kwiatkowski · Kunhao Zheng · Yaqi Duan · Julia Kempe
Abstract:
As large language models increasingly drive realworld applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.
Paperid:1990
Authors:Shantanu Acharya · Fei Jia · Boris Ginsburg
Abstract:
Inference with Transformerbased Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.
Paperid:1991
Authors:Simon Markus Geisler · Tom Wollschläger · M. Hesham Abdalla · Vincent Cohen-Addad · Johannes Gasteiger · Stephan Günnemann
Abstract:
To circumvent the alignment of large language models (LLMs), current optimizationbased adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative response. An affirmative response is a manually designed start of a harmful answer to an inappropriate request. While it is often easy to craft prompts that yield a substantial likelihood for the affirmative response, the attacked model actually frequently does not complete the response in a harmful manner. Such objectives do not adapt to the attacked model and essentially ignore the fact that LLMs output a distribution over responses. If low attack success under these objectives is taken as a measure of robustness, the true robustness might be grossly overestimated. To alleviate these flaws, we propose an adaptive and semantic optimization objective over a population of responses that builds on the REINFORCE policy-gradient formalism. We utilize the state-of-the-art Greedy Coordinate Gradient (GCG) and Projected Gradient Descent (PGD) jailbreak algorithms for our novel adversarial optimization procedure and reveal that current LLMs are even more fragile than previously anticipated. For example, our objective doubles the attack success rate (ASR) on Llama3, and on its circuit-breaker defended version, it achieves an ASR of 61%.
Paperid:1992
Authors:Kunlun Xu · Xu Zou · Gang Hua · Jiahuan Zhou
Abstract:
Domain Incremental Learning (DIL) tackles the critical challenge of learning from nonstationary data streams across domains while retaining and utilizing past knowledge. Although prompt-based methods effectively store multi-domain knowledge in prompt parameters and obtain advanced performance through cross-domain prompt fusion, we reveal an intrinsic limitation: component-wise misalignment between domain-specific prompts leads to conflicting knowledge integration and degraded predictions. This arises from the random positioning of knowledge components within prompts, where irrelevant component fusion introduces interference. To address this, we propose Componential Prompt-Knowledge Alignment (KA-Prompt), a novel prompt-based DIL method that introduces component-aware prompt-knowledge alignment during training, significantly improving both the learning and inference capacity of the model. KA-Prompt operates in two phases: (1) Initial Componential Structure Configuring, where a set of old prompts containing knowledge relevant to the new domain are mined via greedy search, which is then exploited to initialize new prompts to achieve reusable knowledge transfer and establish intrinsic alignment between new and old prompts. (2) Online Alignment Preservation, which dynamically identifies the target old prompts and applies adaptive componential consistency constraints as new prompts evolve. Extensive experiments on DIL benchmarks show KA-Prompt obtains consistent improvements over state-of-the-art methods. Our source code will be publicly available.
Paperid:1993
Authors:KAIJUN LIU · Sijie Ruan · Liang Zhang · Cheng Long · Shuliang Wang · Liang Yu
Abstract:
Recovering human trajectories from incomplete or missing data is crucial for many mobilitybased urban applications, e.g., urban planning, transportation, and location-based services. Existing methods mainly rely on recurrent neural networks or attention mechanisms. Though promising, they encounter limitations in capturing complex spatial-temporal dependencies in low-sampling trajectories. Recently, diffusion models show potential in content generation. However, most of proposed methods are used to generate contents in continuous numerical representations, which cannot be directly adapted to the human location trajectory recovery. In this paper, we introduce a conditional diffusion-based trajectory recovery method, namely, DiffMove. It first transforms locations in trajectories into the embedding space, in which the embedding denoising is performed, and then missing locations are recovered by an embedding decoder. DiffMove not only improves accuracy by introducing high-quality generative methods in the trajectory recovery, but also carefully models the transition, periodicity, and temporal patterns in human mobility. Extensive experiments based on two representative real-world mobility datasets are conducted, and the results show significant improvements (an average of 11% in recall) over the best baselines.
Paperid:1994
Authors:Arwen Bradley · Preetum Nakkiran · David Berthelot · James Thornton · Joshua M Susskind
Abstract:
We study the theoretical foundations of composition in diffusion models, with a particular focus on outof-distribution extrapolation and length-generalization. Prior work has shown that composing distributions via linear score combination can achieve promising results, including length-generalization in some cases (Du et al., 2023; Liu et al., 2022). However, our theoretical understanding of how and why such compositions work remains incomplete. In fact, it is not even entirely clear what it means for composition to "work". This paper starts to address these fundamental gaps. We begin by precisely defining one possible desired result of composition, which we callprojective composition. Then, we investigate: (1) when linear score combinations provably achieve projective composition, (2) whether reverse-diffusion sampling can generate the desired composition, and (3) the conditions under which composition fails. Finally, we connect our theoretical analysis to prior empirical observations where composition has either worked or failed, for reasons that were unclear at the time.
Paperid:1995
Authors:Tiange Liu · Nikola Surjanovic · Miguel Biron-Lattes · Alexandre Bouchard-Côté · Trevor Campbell
Abstract:
Abstract:Many common Markov chain Monte Carlo (MCMC) kernels can be formulated using a deterministic involutive proposal with a step size parameter. Selecting an appropriate step size is often a challenging task in practice; and for complex multiscale targets, there may not be one choice of step size that works well globally. In this work, we address this problem with a novel class of involutive MCMC methodsAutoStep MCMC---that selects an appropriate step size at each iteration adapted to the local geometry of the target distribution. We prove that AutoStep MCMC is $\pi$-invariant and has other desirable properties under mild assumptions on the target distribution $\pi$ and involutive proposal. Empirical results examine the effect of various step size selection design choices, and show that AutoStep MCMC is competitive with state-of-the-art methods in terms of effective sample size per unit cost on a range of challenging target distributions.
Paperid:1996
Authors:Chao Yang · Shuting Cui · Yang Yang · Shuang Li
Abstract:
Understanding human mental states—such as intentions and desires—is crucial for natural AIhuman collaboration. However, this is challenging because human actions occur irregularly over time, and the underlying mental states that drive these actions are unobserved. To tackle this, we propose a novel framework that combines a logic-informed temporal point process (TPP) with amortized variational Expectation-Maximization (EM). Our key innovation is integrating logic rules as priors to guide the TPP’s intensity function, allowing the model to capture the interplay between actions and mental events while reducing dependence on large datasets. To handle the intractability of mental state inference, we introduce a discrete-time renewal process to approximate the posterior. By jointly optimizing model parameters, logic rules, and inference networks, our approach infers entire mental event sequences and adaptively predicts future actions. Experiments on both synthetic and real-world datasets show that our method outperforms existing approaches in accurately inferring mental states and predicting actions, demonstrating its effectiveness in modeling human cognitive processes.
Paperid:1997
Authors:Filip Ekström Kelvinius · Zheng Zhao · Fredrik Lindsten
Abstract:
A recent line of research has exploited pretrained generative diffusion models as priors for solving Bayesian inverse problems. We contribute to this research direction by designing a sequential Monte Carlo method for linear-Gaussian inverse problems which builds on ``decoupled diffusion", where the generative process is designed such that larger updates to the sample are possible. The method is asymptotically exact and we demonstrate the effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC) algorithm on both synthetic data and image reconstruction tasks. Further, we demonstrate how the approach can be extended to discrete data.
Paperid:1998
Authors:Shenyuan Gao · Siyuan Zhou · Yilun Du · Jun Zhang · Chuang Gan
Abstract:
World models aim to learn actioncontrolled prediction models and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action-labeled data and costly training, making it challenging to adapt to novel environments with various actions through limited interactions. This limitation hinders their applicability to broader domains. To overcome this challenge, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self-supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions. This new learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited demonstrations and interactions. We conduct comprehensive experiments across multiple environments to verify the efficacy of AdaWorld. Our model achieves superior performance in both simulation quality and visual planning. The code and model will be made public.
Paperid:1999
Authors:Xinnuo Xu · Rachel Lawrence · Kshitij Dubey · Atharva Pandey · Fabian Falck · Risa Ueno · Aditya Nori · Rahul Sharma · Amit Sharma · Javier Gonzalez
Abstract:
Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true “reasoning” or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces REIMAGINE: a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.
Paperid:2000
Authors:Shahaf E. Finder · Ron Shapira Weber · Moshe Eliasof · Oren Freifeld · Eran Treister
Abstract:
MessagePassing Neural Networks (MPNNs) have become a cornerstone for processing and analyzing graph-structured data. However, their effectiveness is often hindered by phenomena such as over-squashing, where long-range dependencies or interactions are inadequately captured and expressed in the MPNN output. This limitation mirrors the challenges of the Effective Receptive Field (ERF) in Convolutional Neural Networks (CNNs), where the theoretical receptive field is underutilized in practice. In this work, we show and theoretically explain the limited ERF problem in MPNNs. Furthermore, inspired by recent advances in ERF augmentation for CNNs, we propose an Interleaved Multiscale Message-Passing Neural Networks (IM-MPNN) architecture to address these problems in MPNNs. Our method incorporates a hierarchical coarsening of the graph, enabling message-passing across multiscale representations and facilitating long-range interactions without excessive depth or parameterization. Through extensive evaluations on benchmarks such as the Long-Range Graph Benchmark (LRGB), we demonstrate substantial improvements over baseline MPNNs in capturing long-range dependencies while maintaining computational efficiency.
Paperid:2001
Authors:Weiwei Liu
Abstract:
Abstract:Highdimensional inference addresses scenarios where the dimension of the data approaches, or even surpasses, the sample size. In these settings, the regularized $M$-estimator is a common technique for inferring parameters. (Negahban et al.,2009) establish a unified framework for establishing convergence rates in the context of high-dimensional scaling, demonstrating that estimation errors are confined within a restricted set, and revealing fast convergence rates. The key assumption underlying their work is the convexity of the loss function. However, many loss functions in high-dimensional contexts are nonconvex. This leads to the question: if the loss function is nonconvex, do estimation errors still fall within a restricted set? If yes, can we recover convergence rates of the estimation error under nonconvex situations? This paper provides affirmative answers to these critical questions.
Paperid:2002
Authors:Yuchen Wang · Xuefeng Bai · Xinyang Chen · Xiucheng Li · Weili Guan · Liqiang Nie
Abstract:
Adapting visionlanguage models (VLMs) to downstream tasks with pseudolabels has gained increasing attention. A major obstacle is that the pseudolabels generated by VLMs tend to be imbalanced, leading to inferior performance.While existing methods have explored various strategies to address this, the underlying causes of imbalance remain insufficiently investigated.To fill this gap, we delve into imbalanced pseudolabels and identify two primary contributing factors: concept mismatch and concept confusion. To mitigate these two issues, we propose a novel framework incorporating concept alignment and confusion-aware calibrated margin mechanisms. The core of our approach lies in enhancing underperforming classes and promoting balanced predictions across categories, thus mitigating imbalance. Extensive experiments on six benchmark datasets with three learning paradigms demonstrate that the proposed method effectively enhances the accuracy and balance of pseudolabels, achieving a relative improvement of 6.29\% over the SoTA method. Our code is avaliable at https://anonymous.4open.science/r/CAP-C642/
Paperid:2003
Authors:Meir Yossef Levi · Guy Gilboa
Abstract:
Contrastive LanguageImage Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood, and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training.Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty.A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP's modality gap optimizes the matching of the conformity distributions of image and text.
Paperid:2004
Authors:Xiyuan Wei · Ming Lin · Fanjiang Ye · Fengguang Song · Liangliang Cao · My T. Thai · Tianbao Yang
Abstract:
Learning with a reference model (LEAR) leverages a pretrained model to guide and enhance the training of a target model, often through strategic data selection or weighting. While effective in various contexts, including the training of large foundation models, its underlying principles remain insufficiently understood, leading to heuristic approaches that may not achieve optimal performance. In this work, we propose a theory-driven framework for learning with a reference model, rooted in distributionally robust optimization (DRO). Through a generalization analysis of DRO, we provide theoretical insights into why this approach improves generalization compared to training without a reference model. To the best of our knowledge, this is the first time such theoretical insights are provided for LEAR, which significantly enhanced our understanding and practice of LEAR. Building on these insights and the connection between contrastive learning and DRO, we introduce a novel method for contrastive language-image pretraining (CLIP) with a reference model, termed DRRho-CLIP. Our extensive experiments verify the theoretical insights and also show better performance of our method than existing heuristic approaches.
Paperid:2005
Authors:Alex Crane · Thomas Stanley · Blair D. Sullivan · Nate Veldt
Abstract:
We consider a framework for clustering edgecolored hypergraphs, where the goal is to cluster (equivalently, tocolor) objects based on the primary type of multiway interactions they participate in. One well-studied objective is to color nodes to minimize the number of unsatisfied hyperedges---hyperedges containing a node whose color does not match the hyperedge color. We motivate and present advances for two variants of this simple minimization problem. The first is new algorithms for maximizing satisfied edges, which is the same at optimality but is much more challenging to approximate, and previously was only studied in graphs. We develop the first approximation algorithm for hypergraphs, and then refine it to improve the best-known approximation factor for graphs. The second direction considers new objective functions that incorporate notions of balance and fairness in terms of colors, providing new hardness results, approximations, and fixed-parameter tractability results.
Paperid:2006
Authors:Chaochen Gao · Xing W · Zijia Lin · Debing Zhang · Songlin Hu
Abstract:
Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize longcontext data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong's ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.
Paperid:2007
Authors:Juan Perdomo
Abstract:
Social predictions do not passively describe the future; they actively shape it. They inform actions and change individual expectations in ways that influence the likelihood of the predicted outcome and the distribution of observed features. Given this impact, to what extent is accurate social prediction possible? This question was discussed throughout the 20th century by authors like Merton, Morgenstern, Simon, and others who considered it a central issue in social science methodology. In this work, we provide a modern answer to this old problem. Using recent ideas from performative prediction and outcome indistinguishability, we establish that one can always efficiently predict (binary) social events accurately, regardless of how predictions influence data. While achievable, we also show that these predictions are often undesirable, highlighting the limitations of previous desiderata.
Paperid:2008
Authors:Marino Kühne · Panagiotis D. Grontas · Giulia De Pasquale · Giuseppe Belgioioso · Florian Dörfler · John Lygeros
Abstract:
Although social networks have expanded the range of ideas and information accessible to users, they are also criticized for amplifying the polarization of user opinions. Given the inherent complexity of these phenomena, existing approaches to counteract these effects typically rely on handcrafted algorithms and heuristics. We propose an elegant solution: we act on the network weights that model user interactions on social networks (e.g., frequency of communication), to optimize a performance metric (e.g., polarization reduction), while users’ opinions follow the classical FriedkinJohnsen model. Our formulation gives rise to a challenging large-scale optimization problem with non-convex constraints, for which we develop a gradient based algorithm. Our scheme is simple, scalable, and versatile, as it can readily integrate different, potentially non-convex, objectives. We demonstrate its merit by: (i) rapidly solving complex social network intervention problems with 3 million variables based on the Reddit and DBLP datasets; (ii) significantly outperforming competing approaches in terms of both computation time and disagreement reduction.
Paperid:2009
Authors:jiayue Liu · Zhengyang Zhou · Zhongchao Yi · Qihe Huang · Kuo Yang · Xu Wang · Yang Wang
Abstract:
Discovering regularities from spatiotemporal systems can benefit various scientific and social planning. Current spatiotemporal learners usually train an independent model from a specific source data that leads to limited transferability among sources, where even correlated tasks requires new design and training. The key towards increasing crossdomain knowledge is to enable collective intelligence and model evolutionary. In this paper, inspired by neuroscience theories, we theoretically derive the increased information boundary via learning cross-domain collective intelligence and propose a Synaptic EVOlutional spatiotemporal network, SynEVO, where SynEVO breaks the model independence and enables cross-domain knowledge to be shared and aggregated. Specifically, we first re-order the sample groups to imitate the human curriculum learning, and devise two complementary learners, elastic common container and task-independent extractor to allow model growth and task-wise commonality and personality disentanglement. Then an adaptive dynamic coupler with a new difference metric determines whether the extracted personal representation should be incorporated into common container to achieve model evolution under various domains. Experiments show that SynEVO improves the generalization capacity by at most 42\% under cross-domain scenarios and SynEVO provides a paradigm of NeuroAI for knowledge transfer and adaptation.
Paperid:2010
Authors:Mengzhu Wang · houcheng su · Jiao Li · Chuan Li · Nan Yin · Li Shen · Jingcai Guo
Abstract:
Semisupervised learning (SSL) has made notable advancements in medical image segmentation (MIS), particularly in scenarios with limited labeled data and significantly enhancing data utilization efficiency. Previous methods primarily focus on complex training strategies to utilize unlabeled data but neglect the importance of graph structural information. Different from existing methods, we propose a graph-based clustering for semi-supervised medical image segmentation (GraphCL) by jointly modeling graph data structure in a unified deep model. The proposed GraphCL model enjoys several advantages. Firstly, to the best of our knowledge, this is the first work to model the data structure information for semi-supervised medical image segmentation (SSMIS). Secondly, to get the clustered features across different graphs, we integrate both pairwise affinities between local image features and raw features as inputs. Extensive experimental results on three standard benchmarks show that the proposed GraphCL algorithm outperforms state-of-the-art semi-supervised medical image segmentation methods.
Paperid:2011
Authors:Zhipeng Wei · Yuqi Liu · N. Benjamin Erichson
Abstract:
Jailbreaking techniques trick Large Language Models (LLMs) into producing restricted outputs, posing a serious threat. One line of defense is to use another LLM as a Judge to evaluate the harmfulness of generated text. However, we reveal that these Judge LLMs are vulnerable to token segmentation bias, an issue that arises when delimiters alter the tokenization process, splitting words into smaller subtokens. This disrupts the embeddings of the entire sequence, reducing detection accuracy and allowing harmful content to be misclassified as safe. In this paper, we introduce Emoji Attack, a novel strategy that amplifies existing jailbreak prompts by exploiting token segmentation bias. Our method leverages in-context learning to systematically insert emojis into text before it is evaluated by a Judge LLM, inducing embedding distortions that significantly lower the likelihood of detecting unsafe content. Unlike traditional delimiters, emojis also introduce semantic ambiguity, making them particularly effective in this attack. Through experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack substantially reduces the "unsafe" prediction rate, bypassing existing safeguards.
Paperid:2012
Authors:Dapeng Jiang · Xiangzhe Kong · Jiaqi Han · Mingyu Li · Rui Jiao · Wenbing Huang · Stefano Ermon · Jianzhu Ma · Yang Liu
Abstract:
Cyclic peptides, characterized by geometric constraints absent in linear peptides, offer enhanced biochemical properties, presenting new opportunities to address unmet medical needs. However, designing targetspecific cyclic peptides remains underexplored due to limited training data. To bridge the gap, we propose CP-Composer, a novel generative framework that enables zero-shot cyclic peptide generation via composable geometric constraints. Our approach decomposes complex cyclization patterns into unit constraints, which are incorporated into a diffusion model through geometric conditioning on nodes and edges. During training, the model learns from unit constraints and their random combinations in linear peptides, while at inference, novel constraint combinations required for cyclization are imposed as input. Experiments show that our model, despite trained with linear peptides, is capable of generating diverse target-binding cyclic peptides, reaching success rates from 38\% to 84\% on different cyclization strategies.
Paperid:2013
Authors:Jonas Gehring · Kunhao Zheng · Jade Copet · Vegard Mella · Taco Cohen · Gabriel Synnaeve
Abstract:
Large language models (LLMs) deployed as agents solve userspecified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new start-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.
Paperid:2014
Authors:Shaofeng Yin · Jialong Wu · Siqiao Huang · Xingjian Su · he · Jianye Hao · Mingsheng Long
Abstract:
Heterogeneity in sensors and actuators across environments poses a significant challenge to building largescale pre-trained world models on top of this low-dimensional sensor information. In this work, we explore pre-training world models for heterogeneous environments by addressing key transfer barriers in both data diversity and model flexibility. We introduce UniTraj, a unified dataset comprising over one million trajectories from 80 environments, designed to scale data while preserving critical diversity. Additionally, we propose TrajWorld, a novel architecture capable of flexibly handling varying sensor and actuator information and capturing environment dynamics in-context. Pre-training TrajWorld on UniTraj demonstrates significant improvements in transition prediction and achieves a new state-of-the-art for off-policy evaluation. To the best of our knowledge, this work, for the first time, demonstrates the transfer benefits of world models across heterogeneous and complex control environments. The dataset and pre-trained model will be released to support future research.
Paperid:2015
Authors:Tao Zhang · Zhenhai Liu · Feipeng Qi · Yongjun Jiao · Tailin Wu
Abstract:
Multiphysics simulation, which models the interactions between multiple physical processes, and multicomponent simulation of complex structures are critical in fields like nuclear and aerospace engineering. Previous studies often rely on numerical solvers or machine learning-based surrogate models to solve or accelerate these simulations. However, multiphysics simulations typically require integrating multiple specialized solvers—each responsible for evolving a specific physical process—into a coupled program, which introduces significant development challenges. Furthermore, existing numerical algorithms face difficulties when dealing with highly complex large-scale structures in multi-component simulations. Here we propose compositional Multiphysics and Multi-component Simulation with Diffusion models (MultiSimDiff) to overcome these challenges. During diffusion-based training, MultiSimDiff learns energy functions modeling the conditional probability of one physical process/component conditioned on other processes/components. In inference, MultiSimDiff generates coupled multiphysics solutions and multi-component structures by sampling from the joint probability distribution, achieved by composing the learned energy functions in a structured way. We test our method in three tasks. In the reaction-diffusion and nuclear thermal coupling problems, MultiSimDiff successfully predicts the coupling solution using decoupled data, while the surrogate model fails in the more complex second problem. For the thermal and mechanical analysis of the prismatic fuel element, MultiSimDiff trained for single component prediction accurately predicts a larger structure with 64 components, reducing the relative error by 40.3% compared to the surrogate model.
Paperid:2016
Authors:Samuel Holt · Max Ruiz Luyten · Antonin Berthon · M van der Schaar
Abstract:
Constructing robust and flexible simulators is essential for safely asking “what if?” questions and assessing policy decisions in domains such as healthcare, logistics, and epidemic planning. Yet existing approaches often rely either on purely datadriven models—that cannot extrapolate beyond the historical distributions observed in the available sparse or partially observed datasets—or attempt to generate simulators solely via Large Language Models (LLMs), risking the compound effect of subtle inaccuracies and misalignment with empirical evidence. We propose a hybrid framework for automatic environment creation that integrates explicitdomain knowledge, encoded and refined through an LLM, with agradient-free optimizationstep to estimate fixed parameters aligned with empirical evidence. Concretely, an LLM-guided search loop identifies the simulator's structural components—such as submodules capturing different processes, causal relationships, and inter-component couplings. A gradient-free optimization procedure then fits these submodules to real-world data, accommodating non-differentiable, discrete, or stochastic elements. This design harnesses common-sense and domain-specific priors to mitigate data inefficiency and out-of-distribution brittleness, while remaining grounded in reality, being flexible to diverse requirements, and admitting system-level interventions. Our method provides a blueprint for building more reliable, causally informed, and uncertainty-aware simulators, paving the way for better decision-making in complex real-world systems.
Paperid:2017
Authors:Rui Luo · Zhixin Zhou
Abstract:
Conformal prediction provides a robust framework for generating prediction sets with finitesample coverage guarantees, independent of the underlying data distribution. However, existing methods typically rely on a single conformity score function, which can limit the efficiency and informativeness of the prediction sets. In this paper, we present a novel approach that enhances conformal prediction for multi-class classification by optimally averaging multiple conformity score functions. Our method involves assigning weights to different score functions and employing various data splitting strategies. Additionally, our approach bridges concepts from conformal prediction and model averaging, offering a more flexible and efficient tool for uncertainty quantification in classification tasks. We provide a comprehensive theoretical analysis grounded in Vapnik–Chervonenkis (VC) theory, establishing finite-sample coverage guarantees and demonstrating the efficiency of our method. Empirical evaluations on benchmark datasets show that our weighted averaging approach consistently outperforms single-score methods by producing smaller prediction sets without sacrificing coverage.
Paperid:2018
Authors:Alon Arad · Saharon Rosset
Abstract:
Accurate and reliable probability predictions are essential for multiclass supervised learning tasks, where well-calibrated models enable rational decision-making. While isotonic regression has proven effective for binary calibration, its extension to multi-class problems via one-vs-rest calibration often produces suboptimal results, limiting its practical adoption. In this work, we propose novel isotonic normalization-aware techniques for multi-class calibration, grounded in natural and intuitive assumptions expected by practitioners. Unlike prior approaches, our methods inherently account for probability normalization by either incorporating normalization directly into the optimization process (NA-FIR) or modeling the problem as a cumulative bivariate isotonic regression (SCIR). Empirical evaluations on a variety of text and image classification datasets across different model architectures reveal that our approach consistently improves log loss and expected calibration error (ECE) metrics. These findings underscore the potential of our approach to enhance a-parametric multi-class calibration practices, offering an adaptable solution for real-world applications.
Paperid:2019
Authors:Junbiao Cui · Qin Yue · Jianqing Liang · Jiye Liang
Abstract:
Classification is a cornerstone of machine learning research. Most of the existing classifiers implicitly assume that the concepts corresponding to classes can be precisely defined. This notion diverges from the widely accepted understanding in cognitive science, which posits that realworld concepts are often inherently ambiguous. To bridge this big gap, we propose a Human Cognition-Inspired Hierarchical Fuzzy Learning Machine (HC-HFLM), which leverages a novel hierarchical alignment loss to integrate rich class knowledge from human knowledge system into learning process. We further theoretically prove that minimizing this loss can align the hierarchical structure derived from data with those contained in class knowledge, resulting in clear semantic meaning and high interpretability. Systematic experiments verify that the proposed method can achieve significant gains in interpretability and generalization performance.
Paperid:2020
Authors:Seunghan Lee · Taeyoung Park · Kibok Lee
Abstract:
Channel identification (CID) refers to the ability to distinguish among individual channels in time series (TS) modeling. Motivated by the observation that TS models without CID often apply identical weights across channels, we highlight the importance of CID and proposeChannel Normalization(CN), a simple yet effective normalization strategy that enhances CID by assigningdistinctaffine transformation parameters toeach channel. We further extend CN in two ways: 1)Adaptive CN(ACN) dynamically adjusts parameters based on the input TS, improving adaptability in TS models, and 2)Prototypical CN(PCN) introduces a set of learnable prototypes instead of perchannel parameters, enabling applicability to datasets with unknown or varying number of channels and facilitating use in TS foundation models. We demonstrate the effectiveness of CN and its variants by applying them to various TS models, achieving significant performance gains for both non-CID and CID models. In addition, we analyze the success of our approach from an information theory perspective. Code is available atanonymous.4open.science/r/channel_normalization-9F1C/.
Paperid:2021
Authors:Mingzhao Yang · Shangchao Su · Bin Li · Xiangyang Xue
Abstract:
In recent years, Oneshot Federated Learning (OSFL) methods based on Diffusion Models (DMs) have garnered increasing attention due to their remarkable performance. However, most of these methods require the deployment of foundation models on client devices, which significantly raises the computational requirements and reduces their adaptability to heterogeneous client models compared to traditional FL methods. In this paper, we propose FedLMG, a heterogeneous one-shot Federated learning method with Local Model-Guided diffusion models. Briefly speaking, in FedLMG, clients do not need access to any foundation models but only train and upload their local models, which is consistent with traditional FL methods. On the clients, we employ classification loss and BN loss to capture the broad category features and detailed contextual features of the client distributions. On the server, based on the uploaded client models, we utilize backpropagation to guide the server’s DM in generating synthetic datasets that comply with the client distributions, which are then used to train the aggregated model. By using the locally trained client models as a medium to transfer client knowledge, our method significantly reduces the computational requirements on client devices and effectively adapts to scenarios with heterogeneous clients. Extensive quantitation and visualization experiments on three large-scale real-world datasets, along with theoretical analysis, demonstrate that the synthetic datasets generated by FedLMG exhibit comparable quality and diversity to the client datasets, which leads to an aggregated model that outperforms all compared methods and even the performance ceiling, further elucidating the significant potential of utilizing DMs in FL.
Paperid:2022
Authors:Michalis Titsias
Abstract:
Abstract:Sparse variational Gaussian processes (GPs) construct tractable posterior approximations to GP models. At the core of these methods is the assumption that the true posterior distribution over training function values ${\bf f}$ and inducing variables ${\bf u}$ is approximated by a variational distribution that incorporates the conditional GP prior $p({\bf f} | {\bf u})$ in its factorization. While this assumption is considered as fundamental, we show that for model training we can relax it through the use of a more general variational distribution $q({\bf f} | {\bf u} )$ that depends on $N$ extra parameters, where $N$ is the number of training examples. In GP regression, we can analytically optimize the evidence lower bound over the extra parameters and express a tractable collapsed bound that is tighter than the previous bound. The new bound is also amenable to stochastic optimization and its implementation requires minor modifications to existing sparse GP code. Further, we also describe extensions to nonGaussian likelihoods. On several datasets we demonstrate that our method can reduce bias when learning the hyperpaparameters and can lead to better predictive performance.
Paperid:2023
Authors:Jintao Tong · Ran Ma · Yixiong Zou · Guangyao Chen · Yuhua Li · Ruixuan Li
Abstract:
Crossdomain few-shot segmentation (CD-FSS) is proposed to first pre-train the model on a source-domain dataset with sufficient samples, and then transfer the model to target-domain datasets where only a few training samples are available for efficient finetuning. There are majorly two challenges in this task: (1) the domain gap and (2) finetuning with scarce data. To solve these challenges, we revisit the adapter-based methods, and discover an intriguing insight not explored in previous works: the adapter not only helps the fine-tuning of downstream tasks but also naturally serves as a domain information decoupler. Then, we delve into this finding for an interpretation, and we find the model's inherent structure could lead to a natural decoupling of domain information. Building upon this insight, we propose the Domain Feature Navigator (DFN), which is a structure-based decoupler instead of loss-based ones like current works, to capture domain-specific information, thereby directing the model's attention towards domain-agnostic knowledge. Moreover, to prevent the potential excessive overfitting of DFN during the source-domain training, we further design the SAM-SVN method to constrain DFN from learning sample-specific knowledge. On target domains, we freeze the model and fine-tune the DFN to learn knowledge specific to target domains. Extensive experiments demonstrate that our method surpasses the state-of-the-art method in CD-FSS significantly by 2.69% and 4.68% average MIoU in 1-shot and 5-shot scenarios, respectively.
Paperid:2024
Authors:Xuqian Xue · Yiming Lei · Qi Cai · Hongming Shan · Junping Zhang
Abstract:
While contrastive multiview clustering has achieved remarkable success, it implicitly assumes balanced class distribution. However, real-world multi-view data primarily exhibits class imbalance distribution. Consequently, existing methods suffer substantial performance degradation due to their inability to perceive and model such imbalance.To address this challenge, we present the first systematic study of imbalanced multi-view clustering, focusing on two fundamental problems: i.perceiving class imbalance distribution, and ii.mitigating representation degradation of minority samples. We propose PROTOCOL, a novel PaRtial Optimal TranspOrt-enhanced COntrastive Learning framework for imbalanced multi-view clustering.First, for class imbalance perception, we map multi-view features into a consensus space and reformulate the imbalanced clustering as a partial optimal transport problem, augmented with progressive mass constraintsand weighted KL divergence for class distributions.Second, we develop a partial optimal transport (POT)-enhanced class-rebalanced contrastive learning at both feature and class levels, incorporating logit adjustment and class-sensitive learning to enhance minority sample representations.Extensive experiments on five datasets demonstrate that PROTOCOL significantly improves clustering performance on imbalanced multi-view data, filling a critical research gap in this field.
Paperid:2025
Authors:Yiheng Xu · Zekun Wang · Junli Wang · Dunjie Lu · Tianbao Xie · Amrita Saha · Doyen Sahoo · Tao Yu · Caiming Xiong
Abstract:
Automating GUI tasks remains challenging due to reliance on textual representations, platformspecific action spaces, and limited reasoning capabilities. We introduce Aguvis, a unified vision-based framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions, and incorporates structured reasoning via inner monologue. To enable this, we construct Aguvis Data Collection, a large-scale dataset with multimodal grounding and reasoning annotations, and develop a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show that Aguvis achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models. We open-source all datasets, models, and training recipes to advance future research.
Paperid:2026
Authors:Youssef Allouah · Rachid Guerraoui · John Stephan
Abstract:
Resilience against malicious parties and data privacy are essential for trustworthy distributed learning, yet achieving both with good utility typically requires the strong assumption of a trusted central server. This paper shows that a significantly weaker assumption suffices: each pair of workers shares a randomness seed unknown to others. In a setting where malicious workers may collude with an untrusted server, we propose CafCor, an algorithm that integrates robust gradient aggregation with correlated noise injection, leveraging shared randomness between workers.We prove that CafCor achieves strong privacyutility trade-offs, significantly outperforming local differential privacy (DP) methods, which do not make any trust assumption, while approaching central DP utility, where the server is fully trusted. Empirical results on standard benchmarks validate CafCor's practicality, showing that privacy and robustness can coexist in distributed systems without sacrificing utility or trusting the server.
Paperid:2027
Authors:Kyurae Kim · Zuheng Xu · Jacob Gardner · Trevor Campbell
Abstract:
The performance of sequential Monte Carlo (SMC) samplers heavily depends on the tuning of the Markov kernels used in the path proposal. For SMC samplers with unadjusted Markov kernels, standard tuning objectives, such as the MetropolisHastings acceptance rate or the expected-squared jump distance, are no longer applicable. While stochastic gradient-based end-to-end optimization algorithms have been explored for tuning SMC samplers, they often incur excessive training costs, even for tuning just the kernel step sizes. In this work, we propose a general adaptation framework for tuning the Markov kernels in SMC samplers by minimizing the incremental Kullback-Leibler (KL) divergence between the proposal and target paths. For step size tuning, we provide a gradient- and tuning-free algorithm that is generally applicable for kernels such as Langevin Monte Carlo (LMC). We further demonstrate the utility of our approach by providing a tailored scheme for tuning kinetic LMC used in SMC samplers. Our implementations are able to obtain a full schedule of tuned parameters at the cost of a few vanilla SMC runs, which is a fraction of gradient-based approaches.
Paperid:2028
Authors:Manon Revel · Smitha Milli · Tyler Lu · Jamelle Watson-Daniels · Maximilian Nickel
Abstract:
Online comment sections, such as those on news sites or social media, have the potential to foster informal public deliberation. However, this potential is often undermined by the frequency of toxic or lowquality exchanges in these settings. Algorithmic ranking is increasingly deployed to facilitate higher-quality discussions, e.g., by using civility classifiers or prosocial ranking. Yet, these interventions can also inadvertently undermine another key aspect of deliberation: representation of diverse views. We seek to remedy this problem by introducing guarantees of representation into these methods. We adopt a notion of justified representation (JR) from the social choice literature and incorporate a JR constraint into the comment ranking setting. Our theoretical and empirical results show that, in natural settings where user opinions cluster into a few groups, enforcing JR substantially improves representation and inclusivity of diverse views with minimal impact to other measures of conversational quality or to user engagement.
Paperid:2029
Authors:Yang Shen · Xiu-Shen Wei · Yifan Sun · YuXin Song · Tao Yuan · Jian Jin · He-Yang Xu · Yazhou Yao · Errui Ding
Abstract:
Abstract:Computer Vision (CV) has yet to fully achieve the zeroshot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we rethink the reality that CV adopts discrete and terminological task definitions (\eg, "image segmentation"), and conjecture it is a key barrier that hampers zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks—due to these terminological definitions—deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million "image input $\to$ explanatory instruction $\to$ output" triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be open-sourced.
Paperid:2030
Authors:Samantha Chen · Pankaj Agarwal · Yusu Wang
Abstract:
The ability to acquire highresolution, large-scale geospatial data at an unprecedented using LiDAR and other related technologies has intensified the need for scalable algorithms for terrain analysis, including \emph{shortest-path-distance} (SPD) queries on large-scale terrain digital elevation models (DEMs). In this paper, we present a \emph{neural data structure} for efficiently answering SPD queries approximately on a large terrain DEM, which is based on the recently proposed neural geodesic field (NeuroGF) framework \cite{ZhangHAWH23}---the state-of-the-art neural data structure for estimating geodesic distance.In particular, we propose a decoupled-NeuroGF data structure combined with an efficient two-stage mixed-training strategy, which significantly reduces computational bottlenecks and enables efficient training on terrain DEMs at a scale not feasible before. We demonstrate the efficacy of our approach by performing detailed experiments on both synthetic and real data sets.For instance, we can train a small model with around 70000 parameters on a terrain DEM on a terrain DEM with 16 million nodes in a matter of hours that can answer SPD queries with 1\% relative error in at most 10ms per query.
Paperid:2031
Authors:Sebastian Bordt · Suraj Srinivas · Valentyn Boreiko · Ulrike Luxburg
Abstract:
The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we challenge the common assumption that smallscale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. Next, we study the impact of the weight decay parameter on example forgetting, showing that empirical forgetting occurs faster than the cumulative weight decay. This allows us to gauge the degree of example forgetting in large-scale training runs, indicating that many LLMs, including Lllama 3 405B, have forgotten the data seen at the beginning of training.
Paperid:2032
Authors:Yifan Zhang · Zijian Wei · Haojie Ren · Changliang Zou
Abstract:
Abstract:Online multiple hypothesis testing has attracted a lot of attention in many applications, e.g., weekly A/B testing, and stock market price monitoring. The stateof-the-art generalized $\alpha$-investing (GAI) algorithms can control online false discovery rate (FDR) on p-values only under specific dependence structures, a situation that rarely occurs in practice. e-LOND (Xu & Ramdas, 2024) algorithm utilizes e-values to achieve online FDR control under arbitrary dependence but suffers from a significant loss in power as testing levels are derived from pre-specified descent sequences. To address these limitations, we propose a novel framework on valid e-values named e-GAI. The proposed e-GAI can ensure provable online FDR control under arbitrary dependence while improving the power by dynamically allocating the testing levels. These testing levels are updated not only by relying on both the number of previous rejections and the prior costs, but also by assigning less $\alpha$-wealth for each rejection from a risk aversion perspective. Within the e-GAI framework, we introduce two new online FDR procedures, e-LORD and e-SAFFRON, an analogue of LORD and SAFFRON procedures, two popular algorithms in the GAI framework, respectively. Both simulated and real data experiments demonstrate the advantages of both e-LORD and e-SAFFRON in FDR control and power.
Paperid:2033
Authors:XIAOXUAN HAN · Songlin Yang · Wei Wang · Yang Li · JING DONG
Abstract:
Textto-image (T2I) diffusion models have raised concerns about generating inappropriate content, such as "nudity". Despite efforts to erase undesirable concepts through unlearning techniques, these unlearned models remain vulnerable to adversarial inputs that can potentially regenerate such content. Therefore, establishing robust adversarial defenses is crucial for safeguarding unlearned diffusion models. However, existing defense methods struggle to balance adversarial robustness with the generative capabilities of T2I models (i.e., model utility). To address these challenges, we propose a novel inference-time defense strategy to mitigate the impact of adversarial inputs. Specifically, we first reformulate the challenge of ensuring robustness in unlearned diffusion models as a robust regression problem. We then extend the naive median smoothing for regression robustness, which employs isotropic Gaussian noise, to a generalized median smoothing framework incorporating anisotropic noise. Based on this, we propose a token-wiseAdaptive Median Smoothingmethod that dynamically adjusts noise intensity according to the relevance of tokens to target concepts. Furthermore, to improve inference efficiency, we explore implementations of this adaptive method at the text-encoding stage. This adaptive method enhances adversarial robustness while preserving model utility and inference efficiency, as demonstrated by extensive experiments that show it efficiently boosts the robustness of unlearned models and maintains their original generative capabilities, outperforming baseline defense approaches.
Paperid:2034
Authors:Tianyu Zhang · Andrew Williams · Phillip Wozny · Kai-Hendrik Cohrs · Koen Ponse · Marco Jiralerspong · Soham Phade · Sunil Srinivasa · Lu Li · Yang Zhang · Prateek Gupta · Erman Acar · Irina Rish · Yoshua Bengio · Stephan Zheng
Abstract:
Global cooperation on climate change mitigation is essential to limit temperature increases while supporting longterm, equitable economic growth and sustainable development. Achieving such cooperation among diverse regions, each with different incentives, in a dynamic environment shaped by complex geopolitical and economic factors, without a central authority, is a profoundly challenging game-theoretic problem. This article introduces RICE-N, a multi-region integrated assessment model that simulates the global climate, economy, and climate negotiations and agreements. RICE-N uses multi-agent reinforcement learning (MARL) to encourage agents to develop strategic behaviors based on the environmental dynamics and the actions of the others. We present two negotiation protocols: (1) Bilateral Negotiation, an exemplary protocol and (2) Basic Club, inspired from Climate Clubs and the carbon border adjustment mechanism (Nordhaus, 2015; Comissions, 2022). We compare their impact against a no-negotiation baseline with various mitigation strategies, showing that both protocols significantly reduce temperature growth at the cost of a minor drop in production while ensuring a more equitable distribution of the emission reduction costs.
Paperid:2035
Authors:Lisha Chen · Quan Xiao · Ellen Fukuda · Xinyi Chen · Kun Yuan · Tianyi Chen
Abstract:
Multiobjective learning under user-specified preference is common in real-world problems such as portfolio management and molecule design. In this work, we frame such a problem as a semivectorial bilevel optimization problem, whose goal is to optimize a pre-defined preference function, subject to the constraint that the model parameters are weakly Pareto optimal or Pareto stationary. To solve this problem, we propose a two-step reformulation. First, we convert the multi-objective constraints to a single-objective constraint through a merit function with an easy-to-evaluate gradient. Second, we use a penalty-based reformulation of the bilevel optimization problem. We theoretically establish the relations of solutions for different formulations. Then we propose algorithms to solve the reformulated optimization problem, and establish its convergence guarantees. To evaluate the performance of the proposed algorithm, we test the method on various synthetic and real-world problems. The results demonstrate the effectiveness of the proposed method in finding preference-guided optimal solutions to the multi-objective problem.
Paperid:2036
Authors:Zheng Fang · Lichuan Xiang · Xu Cai · Kaicheng Zhou · Hongkai Wen
Abstract:
ControlNet offers a powerful way to guide diffusion‐based generative models, yet most implementations rely on adhoc heuristics to choose which network blocks to control—an approach that varies unpredictably with different tasks. To address this gap, we propose FlexControl, a novel framework that copies all diffusion blocks during training and employs a trainable gating mechanism to dynamically select which blocks to activate at each denoising step. With introducing a computation-aware loss, we can encourage control blocks only to activate when it benefit the generation quality. By eliminating manual block selection, FlexControl enhances adaptability across diverse tasks and streamlines the design pipeline, with computation-aware training loss in an end-to-end training manner. Through comprehensive experiments on both UNet (e.g., SD1.5) and DiT (e.g., SD3.0), we show that our method outperforms existing ControlNet variants in certain key aspects of interest. As evidenced by both quantitative and qualitative evaluations, FlexControl preserves or enhances image fidelity while also reducing computational overhead by selectively activating the most relevant blocks. These results underscore the potential of a flexible, data‐driven approach for controlled diffusion and open new avenues for efficient generative model design.
Paperid:2037
Authors:Suyuan Zhao · YIZHEN LUO · Ganbo Yang · Yan Zhong · Hao Zhou · Zaiqing Nie
Abstract:
Spatial Transcriptomics (ST) technologies provide biologists with rich insights into singlecell biology by preserving spatial context of cells.Building foundational models for ST can significantly enhance the analysis of vast and complex data sources, unlocking new perspectives on the intricacies of biological tissues. However, modeling ST data is inherently challenging due to the need to extract multi-scale information from tissue slices containing vast numbers of cells. This process requires integrating macro-scale tissue morphology, micro-scale cellular microenvironment, and gene-scale gene expression profile.To address this challenge, we proposeSToFM, a multi-scaleSpatialTranscriptomicsFoundationModel.SToFM first performs multi-scale information extraction on each ST slice, to construct a set of ST sub-slices that aggregate macro-, micro- and gene-scale information. Then an SE(2) Transformer is used to obtain high-quality cell representations from the sub-slices.Additionally, we constructSToCorpus-88M, the largest high-resolution spatial transcriptomics corpus for pretraining. SToFM achieves outstanding performance on a variety of downstream tasks, such as tissue region semantic segmentation and cell type annotation, demonstrating its comprehensive understanding of ST data through capturing and integrating multi-scale information.
Paperid:2038
Authors:Joanna Waczyńska · Tomasz Szczepanik · Piotr Borycki · Slawomir Tadeja · Thomas Bohné · Przemysław Spurek
Abstract:
Implicit Neural Representations (INRs) approximate discrete data through continuous functions and are commonly used for encoding 2D images. Traditional imagebased INRs employ neural networks to map pixel coordinates to RGB values, capturing shapes, colors, and textures within the network’s weights. Recently, 2D Gaussian Splatting (GS) has been proposed as an alternative, using Gaussian functions instead of neural networks to achieve comparable quality and compression. Such a solution obtains a quality and compression ratio similar to classical INR models but does not allow image modification. In contrast, our work introduces a novel method, MiraGe, which uses mirror reflections to perceive 2D images in 3D space and employs flat-controlled Gaussians for precise 2D image editing. Our approach improves the rendering quality and allows realistic image modifications, including human-inspired perception of photos in the 3D world. Thanks to modeling images in 3D space, we obtain the illusion of 3D-based modification in 2D images. We also show that our Gaussian representation can be easily combined with a physics engine to produce physics-based modification of 2D images. Consequently, MiraGe allows for better quality than the standard approach and for natural modification of 2D images.
Paperid:2039
Authors:Shibo Jie · Yehui Tang · Kai Han · Zhi-Hong Deng · Jing Han
Abstract:
Transformerbased large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the sequence length increases, which has become a bottleneck for the application of LLMs on long sequences. Existing KV cache compression methods include eviction, merging, or quantization of the KV cache to reduce its size. However, compression results in irreversible information forgetting, potentially affecting the accuracy of subsequent decoding. In this paper, we propose SpeCache, which takes full advantage of the large and easily expandable CPU memory to offload the complete KV cache, and dynamically fetches KV pairs back in each decoding step based on their importance measured by low-precision KV cache copy in VRAM. To avoid inference latency caused by CPU-GPU communication, SpeCache speculatively predicts the KV pairs that the next token might attend to, allowing us to prefetch them before the next decoding step which enables parallelization of prefetching and computation. Experiments on LongBench and Needle-in-a-Haystack benchmarks verify that SpeCache effectively reduces VRAM usage while avoiding information forgetting for long sequences without re-training, even with a 10x high KV cache compression ratio.
Paperid:2040
Authors:Pinzheng Wang · Juntao Li · Zecheng Tang · Haijia Gui · Min zhang
Abstract:
Abstract:Large language models (LLMs) have demonstrated considerable reasoning abilities in various tasks such as mathematics and coding.However, recent studies indicate that even the best models lack true comprehension of their reasoning processes.In this paper, we explore how selfplay can enhance the rationality of models in the reasoning process without supervision from humans or superior models.We design a $\textit{\textbf{C}ritic-\textbf{D}iscernment \textbf{G}ame}~(\textbf{CDG})$ in which a prover first provides a solution to a given problem and is subsequently challenged by critiques of its solution. These critiques either aim to assist or mislead the prover. The objective of the prover is to maintain the correct answer when faced with misleading comments, while correcting errors in response to constructive feedback.Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process.
Paperid:2041
Authors:Wei Huang · Haotong Qin · Yangdong Liu · Yawei Li · Qinshuo Liu · Xianglong Liu · Luca Benini · Michele Magno · Shiming Zhang · XIAOJUAN QI
Abstract:
Posttraining quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise with high accuracy. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: 1) Salience-Determined Bit Allocation adaptively assigns bit-widths to groups within each layer based on their salience; and 2) Salience-Weighted Quantizer Calibration optimizes quantizer parameters by incorporating element-level salience, retain essential information. With its structured group-wise partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while significantly improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM+, which incorporates gradient-based quantization, further reduces perplexity by 35.1%.
Paperid:2042
Authors:Anna Soligo · Pietro Ferraro · David Boyle
Abstract:
Interpretability is crucial for ensuring RL systems align with human values. However, it remains challenging to achieve in complex decision making domains. Existing methods frequently attempt interpretability at the level of fundamental model units, such as neurons or decision nodes: an approach which scales poorly to large models. Here, we instead propose an approach to interpretability at the level of functional modularity. We show how encouraging sparsity and locality in network weights leads to the emergence of functional modules in RL policy networks. To detect these modules, we develop an extended Louvain algorithm which uses a novel `correlation alignment' metric to overcome the limitations of standard network analysis techniques when applied to neural network architectures. Applying these methods to 2D and 3D Minigrid environments reveals the consistent emergence of distinct navigational modules for different axes, and we further demonstrate how these functions can be validated through direct interventions on network weights prior to inference.
Paperid:2043
Authors:Jake Snell · Thomas Griffiths
Abstract:
As machine learningbased prediction systems are increasingly used in high-stakes situations, it is important to understand how such predictive models will perform upon deployment. Distribution-free uncertainty quantification techniques such as conformal prediction provide guarantees about the loss black-box models will incur even when the details of the models are hidden. However, such methods are based on frequentist probability, which unduly limits their applicability. We revisit the central aspects of conformal prediction from a Bayesian perspective and thereby illuminate the shortcomings of frequentist guarantees. We propose a practical alternative based on Bayesian quadrature that provides interpretable guarantees and offers a richer representation of the likely range of losses to be observed at test time.
Paperid:2044
Authors:Swarnadeep Saha · Xian Li · Marjan Ghazvininejad · JASON WESTON · Tianlu Wang
Abstract:
Abstract:LLMas-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-by-step reasoning process that underlies the final evaluation of a response. However, due to the lack of human-annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on Reward-Bench (with a score of $93.9$), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.
Paperid:2045
Authors:Jiannian Wang · Yao Lu · Guangming Lu
Abstract:
Image steganography ensures secure information transmission and storage by concealing secret messages within images. Recently, the diffusion model has been incorporated into the generative image steganography task, with text prompts being employed to guide the entire process. However, existing methods are plagued by three problems: (1) the restricted control exerted by text prompts causes generated stego images resemble the secret images and seem unnatural, raising the severe detection risk; (2) inconsistent intermediate states between Denoising Diffusion Implicit Models and its inversion, coupled with limited control of text prompts degrade the revealed secret images; (3) the descriptive text of images(i.e. text prompts) are also deployed as the keys, but this incurs significant security risks for both the keys and the secret images.To tackle these drawbacks, we systematically propose the SSHR, which joints the Reference Images with the adaptive keys to govern the entire process, enhancing the naturalness and imperceptibility of stego images. Additionally, we methodically construct an Exact Reveal Process to improve the quality of the revealed secret images. Furthermore, adaptive ReferenceSecret Image Related Symmetric Keys are generated to enhance the security of both the keys and the concealed secret images. Various experiments indicate that our model outperforms existing methods in terms of recovery quality and secret image security.
Paperid:2046
Authors:Jennifer Brennan · Sébastien Lahaie · Adel Javanmard · Nick Doudchenko · Jean Pouget-Abadie
Abstract:
In experimental causal inference, we distinguish between two sources of uncertainty: design uncertainty, due to the treatment assignment mechanism, and sampling uncertainty, when the sample is drawn from a superpopulation. This distinction matters in settings with small fixed samples and heterogeneous treatment effects, as in geographical experiments. The standard bootstrap procedure most often used by practitioners primarily estimates sampling uncertainty, and the causal bootstrap procedure, which accounts for design uncertainty, was developed for the completely randomized design and the difference-in-means estimator, whereas non-standard designs and estimators are often used in these low-power regimes. We address this gap by proposing an integer program which computes numerically the worst-case copula used as an input to the causal bootstrap method, in a wide range of settings. Specifically, we prove the asymptotic validity of our approach for unconfounded, conditionally unconfounded, and and individualistic with bounded confoundedness assignments, as well as generalizing to any linear-in-treatment and quadratic-in-treatment estimators. We demonstrate the refined confidence intervals achieved through simulations of small geographical experiments.
Paperid:2047
Authors:Rui Gao · Weiwei Liu
Abstract:
Continual learning has attracted increasing research attention in recent years due to its promising experimental results in realworld applications. In this paper, we study the issue of calibration in continual learning which reliably quantifies the uncertainty of model predictions. Conformal prediction (CP) provides a general framework for model calibration, which outputs prediction intervals or sets with a theoretical high coverage guarantee as long as the samples are exchangeable. However, the tasks in continual learning are learned in sequence, which violates the principle that data should be exchangeable. Meanwhile, the model learns the current task with limited or no access to data from previous tasks, which is not conducive to constructing the calibration set. To address these issues, we propose a CP-based method for model uncertainty quantification in continual learning (CPCL), which also reveals the connection between prediction interval length and forgetting. We analyze the oracle prediction interval in continual learning and theoretically prove the asymptotic coverage guarantee of CPCL. Finally, extensive experiments on simulated and real data empirically verify the validity of our proposed method.
Paperid:2048
Authors:Adya Agrawal · Leo de Castro · Daniel Escudero · Antigoni Polychroniadou · Manuela Veloso
Abstract:
Abstract:As large language models (LLMs) become more powerful, the computation required to run these models is increasingly outsourced to a thirdparty cloud. While this saves clients' computation, it risks leaking the clients' LLM queries to the cloud provider. Fully homomorphic encryption (FHE) presents a natural solution to this problem: simply encrypt the query and evaluate the LLM homomorphically on the cloud machine. The result remains encrypted and can only be learned by the client who holds the secret key. In this work, we present a GPU-accelerated implementation of FHE and use this implementation to evaluate an encrypted GPT-2 forward pass, with runtimes over $200\times$ faster than the CPU baseline. We also present novel and extensive experimental analysis of approximations of LLM activation functions to maintain accuracy while achieving this performance.
Paperid:2049
Authors:Qirui Wu · Shizhou Zhang · De Cheng · Yinghui Xing · di xu · Peng Wang · Yanning Zhang
Abstract:
Catastrophic forgetting is a critical chanllenge for incremental object detection (IOD). Most existing methods treat the detector monolithically, relying on instance replay or knowledge distillation without analyzing componentspecific forgetting. Through dissection of Faster R-CNN, we reveal a key insight: Catastrophic forgetting is predominantly localized to the RoI Head classifier, while regressors retain robustness across incremental stages. This finding challenges conventional assumptions, motivating us to develop a framework termed NSGP-RePRE. Regional Prototype Replay (RePRE) mitigates classifier forgetting via replay of two types of prototypes: coarse prototypes represent class-wise semantic centers of RoI features, while fine-grained prototypes model intra-class variations. Null Space Gradient Projection (NSGP) is further introduced to eliminate prototype-feature misalignment by updating the feature extractor in directions orthogonal to subspace of old inputs via gradient projection, aligning RePRE with incremental learning dynamics. Our simple yet effective design allows NSGP-RePRE to achieve state-of-the-art performance on the Pascal VOC and MS COCO datasets under various settings. Our work not only advances IOD methodology but also provide pivotal insights for catastrophic forgetting mitigation in IOD. Code will be available soon.
Paperid:2050
Authors:Siyuan Xu · Minghui Zhu
Abstract:
This paper studies metareinforcement learning with human-in-the-loop adaptation. It aims to pre-train a meta-model that can achieve few-shot adaptation for new tasks from human preference queries without relying on reward signals. To solve the problem, we propose the framework \textit{adaptation via Preference-Order-preserving EMbedding} (POEM). In the meta-training, the framework learns a task encoder, which maps tasks to a preference-order-preserving task embedding space, and a decoder, which maps the embeddings to the task-specific policies. In the human-in-the-loop adaptation, the task encoder facilitates efficient task embedding inference for new tasks from the preference queries and then obtains the task-specific policy. We provide a theoretical guarantee for the convergence of the adaptation process to the task-specific optimal policy and experimentally demonstrate its state-of-the-art performance with substantial improvement over baselines.
Paperid:2051
Authors:Jinyu Wang · Jingjing Fu · Rui Wang · Lei Song · Jiang Bian
Abstract:
Recent advancements in RetrievalAugmented Generation (RAG) systems have significantly enhanced the capabilities of large language models (LLMs) by incorporating external knowledge retrieval. However, the sole reliance on retrieval is often inadequate for mining deep, domain-specific knowledge and for performing logical reasoning from specialized datasets. To tackle these challenges, we present an approach, which is designed to extract, comprehend, and utilize domain knowledge while constructing a coherent rationale. At the heart of our approach lie four pivotal components: a knowledge atomizer that extracts atomic questions from raw data, a query proposer that generates subsequent questions to facilitate the original inquiry, an atomic retriever that locates knowledge based on atomic knowledge alignments, and an atomic selector that determines which follow-up questions to pose guided by the retrieved information. Through this approach, we implement a knowledge-aware task decomposition strategy that adeptly extracts multifaceted knowledge from segmented data and iteratively builds the rationale in alignment with the initial query and the acquired knowledge. We conduct comprehensive experiments to demonstrate the efficacy of our approach across various benchmarks, particularly those requiring multihop reasoning steps. The results indicate a significant enhancement in performance, up to 12.6\% over the second-best method, underscoring the potential of the approach in complex, knowledge-intensive applications.
Paperid:2052
Authors:Angel Villar-Corrales · Sven Behnke
Abstract:
Predicting future scene representations is a crucial task for enabling robots to understand andinteract with the environment. However, mostexisting methods rely on video sequences andsimulations with precise action annotations, lim-iting their ability to leverage the large amountof available unlabeled video data. To addressthis challenge, we propose PlaySlot, an object-centric video prediction model that infers objectrepresentations and latent actions from unlabeledvideo sequences. It then uses these representa-tions to forecast future object states and videoframes. PlaySlot allows to generate multiple pos-sible futures conditioned on latent actions, whichcan be inferred from video dynamics, providedby a user, or generated by a learned action policy,thus enabling versatile and interpretable worldmodeling. Our results show that PlaySlot out-performs both stochastic and object-centric base-lines for video prediction across different envi-ronments. Furthermore, we show that our in-ferred latent actions can be used to learn robot be-haviors sample-efficiently from unlabeled videodemonstrations. Code to reproduce our resultswill be open-sourced upon publication.
Paperid:2053
Authors:Lei Zhang · Jiaxi Yang · Min Yang · Jian Yang · Mouxiang Chen · Jiajun Zhang · Zeyu Cui · Binyuan Hui · Junyang Lin
Abstract:
We introduceUnitFlow, a novel data synthesis framework grounded in TestDriven Development (TDD). Unlike existing software engineering data that rely on human-submitted issues,UnitFlowautomatically infers incremental development steps directly from unit tests, which inherently encapsulate high-level requirements. The core ofUnitFlowis the construction of a Runtime Dependency Graph (RDG), which precisely captures function interactions, enabling the generation of a structured, step-by-stepdevelopment schedule. At each step,UnitFlowproduces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating theUnitFlow-Evalbenchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research and advancements in TDD-based software engineering, we open-source all code, datasets, models, and Docker images.
Paperid:2054
Authors:Xinsong Ma · Jie Wu · Weiwei Liu
Abstract:
Outof-distribution (OOD) detection is a crucial task in reliable and safety-critical applications. Previous studies primarily focus on developing score functions while neglecting the design of decision rules based on these scores. A recent work (Ma et al., 2024) is the first to highlight this issue and proposes the generalized BH (g-BH) algorithm to address it. The g-BH algorithm relies on empirical p-values, with the calibrated set playing a central role in their computation. However, the impact of calibrated set on the performance of g-BH algorithm has not been thoroughly investigated. This paper aims to uncover the underlying mechanisms between them. Theoretically, we demonstrate that conditional expectation of true positive rate (TPR) on calibrated set for the g-BH algorithm follows a beta distribution, which depends on the prescribed level and size of calibrated set. This indicates that a small calibrated set tends to degrade the performance of g-BH algorithm. To address the limitation of g-BH algorithm on small calibrated set, we propose a novel ensemble g-BH (eg-BH) algorithm which integrates various empirical p-values for making decisions and enables to control false discovery rate (FDR) at the prescribed level. Finally, extensive experimental results validate the effectiveness of our theoretical findings and demonstrate the superiority of our method over g-BH algorithm on small calibrated set.
Paperid:2055
Authors:Lin Zhao · Yushu Wu · Xinru Jiang · Jianyang Gu · Yanzhi Wang · Xiaolin Xu · Pu Zhao · Xue Lin
Abstract:
Abstract:Recent deep learning models demand larger datasets, driving the need for dataset distillation to create compact, costefficient datasets while maintaining performance. Due to the powerful image generation capability of diffusion, it has been introduced to this field for generating distilled images. In this paper, we systematically investigate issues present in current diffusion-based dataset distillation methods, including inaccurate distribution matching, distribution deviation with random noise, and separate sampling. Building on this,we propose D$^3$HR, a novel diffusion-based framework to generate distilled datasets with high representativeness. Specifically, we adopt DDIM inversion to map the latents of the full dataset from a low-normality latent domain to a high-normality Gaussian domain, preserving information and ensuring structural consistency to generate representative latents for the distilled dataset. Furthermore, we propose an efficient sampling scheme to better align the representative latents with the high-normality Gaussian distribution. Our comprehensive experiments demonstrate that D$^3$HR can achieve higher accuracy across different model architectures compared with state-of-the-art baselines in dataset distillation.
Paperid:2056
Authors:Sagar Shrestha · Xiao Fu
Abstract:
Diversified distribution matching (DDM) finds a unified translation function mapping a diverse collection of conditional source distributions to their target counterparts. DDM was proposed to resolve content misalignment issues in unpaired domain translation, achieving translation identifiability. However, DDM has only been implemented using GANs due to its constraints on the translation function. GANs are often unstable to train and do not provide the transport trajectory informationyet such trajectories are useful in applications such as single-cell evolution analysis and robot route planning. This work introducesdiversified flow matching(DFM), an ODE-based framework for DDM. Adapting flow matching (FM) to enforce a unified translation function as in DDM is challenging, as FM learns the translation function's velocity rather than the translation function itself. A custom bilevel optimization-based training loss, a nonlinear interpolant, and a structural reformulation are proposed to address these challenges, offering a tangible implementation. To our knowledge, DFM is the first ODE-based approach guaranteeing translation identifiability. Experiments on synthetic and real-world datasets validate the proposed method.
Paperid:2057
Authors:Sophie Hanna Langbein · Niklas Koenen · Marvin N. Wright
Abstract:
Deep learning survival models often outperform classical methods in timeto-event predictions, particularly in personalized medicine, but their "black box" nature hinders broader adoption. We propose a framework for gradient-based explanation methods tailored to survival neural networks, extending their use beyond regression and classification. We analyze the implications of their theoretical assumptions for time-dependent explanations in the survival setting and propose effective visualizations incorporating the temporal dimension. Experiments on synthetic data show that gradient-based methods capture the magnitude and direction of local and global feature effects, including time dependencies. We introduce GradSHAP(t), a gradient-based counterpart to SurvSHAP(t), which outperforms SurvSHAP(t) and SurvLIME in a computational speed vs. accuracy trade-off. Finally, we apply these methods to medical data with multi-modal inputs, revealing relevant tabular features and visual patterns, as well as their temporal dynamics.
Paperid:2058
Authors:Anissa Alloula · Charles Jones · Ben Glocker · Bartlomiej W. Papiez
Abstract:
Despite the constant development of new bias mitigation methods for machine learning, no method consistently succeeds, and a fundamental question remains unanswered: when and why do bias mitigation techniques fail? In this paper, we hypothesise that a key reason may be linked to an oftenoverlooked but crucial step shared by many bias mitigation methods: the definition of subgroups. To investigate this, we conduct a comprehensive evaluation of state-of-the-art bias mitigation methods across multiple semi-synthetic image classification tasks, systematically varying subgroup definitions, including coarse, fine-grained, intersectional, and noisy subgroups. Our findings reveal that subgroup choice significantly impacts performance, with certain groupings paradoxically leading to worse outcomes than no mitigation at all. Through theoretical analysis, we explain these discrepancies and uncover a counter-intuitive insight that, in some cases, improving fairness with respect to a particular set of subgroups is best achieved by using a different set of subgroups for mitigation. Our work highlights the necessity of carefully considering subgroup definition before applying bias mitigation and offers new perspectives for improving the robustness and fairness of machine learning models.
Paperid:2059
Authors:Soheil Behnezhad · Moses Charikar · Vincent Cohen-Addad · Alma Ghafari · Weiyun ma
Abstract:
Abstract:We study the classic correlation clustering problem. Given $n$ objects and a complete labeling of the objectpairs as either “similar” or “dissimilar”, the goal is to partition the objects intoarbitrarily many clusters while minimizing disagreements with the labels.A classic Pivot algorithm for this problem, due to [Ailon et al STOC'05], obtains a 3-approximation for this problem. Over the years, this algorithm has been successfully implemented in various settings. The downside of the Pivot algorithm is that the approximation analysis of 3 is tight for it. While better approximations have been achieved in some settings, these algorithms are often hard to implement in various settings. For example, [Behnezhad et al FOCS19] showed that the output of Pivot can be maintained in polylog time per update in a dynamic setting, a bound that was improved to constant by [Dalirrooyfard et al ICML'24]. But obtaining a better approximation remains open.In this paper, we present Modified Pivot, an algorithm that locally improves the output of Pivot. Our Modified Pivot algorithm can be implemented just as efficiently as Pivot in various settings. Our experiments show that the output of Modified Pivot on average makes less than 77\% of the mistakes made by Pivot. More surprisingly, we prove theoretically that Modified Pivot has approximation ratio $3-\epsilon_0$ for some absolute constant $\epsilon_0 > 0$. This, e.g., leads to a better than 3 approximation in the dynamic setting in polylog time, improving the 3-approximation obtained by [Behnezhad et al FOCS'19] and [Dalirrooyfard et al ICML'24].
Paperid:2060
Authors:YAN SHEN · Ruihai Wu · Yubin Ke · Xinyuan Song · Zeyi Li · Xiaoqi Li · Hongwei Fan · Haoran Lu · Hao Dong
Abstract:
Shape assembly, the process of combining parts into a complete whole, is a crucial skill for robots with broad realworld applications. Among the various assembly tasks, geometric assembly—where broken parts are reassembled into their original form (e.g., reconstructing a shattered bowl)—is particularly challenging. This requires the robot to recognize geometric cues for grasping, assembly, and subsequent bimanual collaborative manipulation on varied fragments. In this paper, we exploit the geometric generalization of point-level affordance, learning affordance aware of bimanual collaboration in geometric assembly with long-horizon action sequences. To address the evaluation ambiguity caused by geometry diversity of broken parts, we introduce a real-world benchmark featuring geometric variety and global reproducibility. Extensive experiments demonstrate the superiority of our approach over both previous affordance-based and imitation-based methods.
Paperid:2061
Authors:Zihan Huang · Wei Fang · Tong Bu · Peng Xue · Zecheng Hao · Wenxuan Liu · Yuanhong Tang · Zhaofei Yu · Tiejun Huang
Abstract:
Spiking Neural Networks (SNNs) exhibit significant potential due to their low energy consumption. Converting Artificial Neural Networks (ANNs) to SNNs is an efficient way to achieve highperformance SNNs. However, many conversion methods are based on rate coding, which requires numerous spikes and longer time-steps compared to directly trained SNNs, leading to increased energy consumption and latency. This article introduces differential coding for ANN-to-SNN conversion, a novel coding scheme that reduces spike counts and energy consumption by transmitting changes in rate information rather than rates directly, and explores its application across various layers. Additionally, the threshold iteration method is proposed to optimize thresholds based on activation distribution when converting Rectified Linear Units (ReLUs) to spiking neurons. Experimental results on various Convolutional Neural Networks (CNNs) and Transformers demonstrate that the proposed differential coding significantly improves accuracy while reducing energy consumption, particularly when combined with the threshold iteration method, achieving state-of-the-art performance.
Paperid:2062
Authors:Dayang Wang · Srivathsa Pasumarthi Venkata · Ajit Shankaranarayanan · Greg Zaharchuk
Abstract:
Contrastenhanced MRI enhances pathological visualization but often necessitates Pre-Contrast images for accurate quantitative analysis and comparative assessment. However, Pre-Contrast images are frequently unavailable due to time, cost, or safety constraints, or they may suffer from degradation, making alignment challenging. This limitation hinders clinical diagnostics and the performance of tools requiring combined image types. To address this challenge, we propose a novel staged, physics-grounded learning framework with a hyperintensity prior to synthesize Pre-Contrast images directly from Post-Contrast MRIs. The proposed method can generate high-quality Pre-Contrast images, thus, enabling comprehensive diagnostics while reducing the need for additional imaging sessions, costs, and patient risks. To the best of our knowledge, this is the first Pre-Contrast synthesis model capable of generating images that may be interchangeably used with standard-of-care Pre-Contrast images. Extensive evaluations across multiple datasets, sites, anatomies, and downstream tasks demonstrate the model’s robustness and clinical applicability, positioning it as a valuable tool for contrast-enhanced MRI workflows.
Paperid:2063
Authors:Jie Bao · Hanwei Zhang · Zhixin Zhou · Chuangyin Dang · Rui Luo
Abstract:
As deep learning models are increasingly deployed in highrisk applications, robust defenses against adversarial attacks and reliable performance guarantees become paramount. Moreover, accuracy alone does not provide sufficient assurance or reliable uncertainty estimates for these models. In this study, we combine adversarial training with Conformal Prediction (CP) and formulate adversarial training within the CP framework as a bi-level optimization problem, where an attacker and a defender pursue opposing objectives to maximize and minimize model uncertainty, respectively. We develop an attack method that does not require coverage guarantees and integrate it with a conformal training-based defense strategy. This algorithm minimizes the size of the prediction sets under adversarial perturbations while maintaining high coverage probabilities. Experimental evaluations on the CIFAR-10 and CIFAR-100 datasets demonstrate that our attack method induces greater uncertainty compared to baseline approaches, while the defensive model significantly enhances robustness against various adversarial attacks and maintains reliable prediction intervals. Our findings highlight the effectiveness of integrating adversarial training with conformal prediction to develop trustworthy and resilient deep learning models for safety-critical domains.
Paperid:2064
Authors:Samar Khanna · Medhanie Irgau · David Lobell · Stefano Ermon
Abstract:
Parameterefficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this new domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 7.5% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines, including PEFT and unfreezing more ViT blocks.
Paperid:2065
Authors:Tinglin Huang · Tianyu Liu · Mehrtash Babadi · Wengong Jin · ZHITAO YING
Abstract:
Spatial transcriptomics (ST) has emerged as a powerful technology for bridging histology imaging with gene expression profiling. However, its application has been limited by low throughput and the need for specialized experimental facilities. Prior work sought to predict ST from wholeslide histology images to accelerate this process, but they suffer from two major limitations. First, they do not explicitly model cell-cell interaction as they factorize the joint distribution of whole-slide ST data and predict the gene expression of each spot independently. Second, their encoders struggle with memory constraints due to the large number of spots (often exceeding 10,000) in typical ST datasets. Herein, we propose STFlow, a flow matching generative model that considers cell-cell interaction by modeling the joint distribution of gene expression of an entire slide. It also employs an efficient slide-level encoder with local spatial attention, enabling whole-slide processing without excessive memory overhead. On the recently curated HEST-1k and STImage-1K4M benchmarks, STFlow substantially outperforms state-of-the-art baselines and achieves over 18% relative improvements over the pathology foundation models.
Paperid:2066
Authors:Tomer Meir · Uri Shalit · Malka Gorfine
Abstract:
Tailoring treatments to individual needs is a central goal in fields such as medicine. A key step toward this goal is estimating Heterogeneous Treatment Effects (HTE)—the way treatments impact different subgroups. While crucial, HTE estimation is challenging with survival data, where time until an event (e.g., death) is key. Existing methods often assume complete observation, an assumption violated in survival data due to rightcensoring, leading to bias and inefficiency. Cui et al. (2023) proposed a doubly-robust method for HTE estimation in survival data under no hidden confounders, combining a causal survival forest with an augmented inverse-censoring weighting estimator. However, we find it struggles under heavy censoring, which is common in rare-outcome problems such as Amyotrophic lateral sclerosis (ALS). Moreover, most current methods cannot handle instrumental variables, which are a crucial tool in the causal inference arsenal. We introduce Multiple Imputation for Survival Treatment Response (MISTR), a novel, general, and non-parametric method for estimating HTE in survival data. MISTR uses recursively imputed survival trees to handle censoring without directly modeling the censoring mechanism. Through extensive simulations and analysis of two real-world datasets—the AIDS Clinical Trials Group Protocol 175 and the Illinois unemployment dataset we show that MISTR outperforms prior methods under heavy censoring in the no-hidden-confounders setting, and extends to the instrumental variable setting. To our knowledge, MISTR is the first non-parametric approach for HTE estimation with unobserved confounders via instrumental variables.
Paperid:2067
Authors:Chuqi CHEN · Yahong Yang · Yang Xiang · Wenrui Hao
Abstract:
Solving partial differential equations (PDEs) using neural networks has become a central focus in scientific machine learning. Training neural networks for sharp interface problems is particularly challenging due to certain parameters in the PDEs that introduce nearsingularities in the loss function. In this study, we overcome this challenge by introducing a novel method based on homotopy dynamics to effectively manipulate these parameters. From a theoretical perspective, we analyze the effects of these parameters on training difficulty in sharp interface problems and establish the convergence of the proposed homotopy dynamics method. Experimentally, we demonstrate that our approach significantly accelerates convergence and improves the accuracy of capturing sharp interfaces. These findings present an efficient optimization strategy leveraging homotopy dynamics, offering a robust framework to extend the applicability of neural networks for solving PDEs with sharp interfaces.
Paperid:2068
Authors:Rongzhen Wang · Yan Zhang · Chenyu Zheng · Chongxuan Li · Guoqiang Wu
Abstract:
The success of large generative models has driven a paradigm shift, leveraging massive multisource data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper provides a first attempt to fill this gap by rigorously analyzing multi-source training in conditional generative modeling, where each condition represents a distinct data source. Specifically, we establish a general distribution estimation error bound in average total variation distance for conditional maximum likelihood estimation based on the bracketing number.Our result shows that when source distributions share certain similarities and the model is expressive enough, multi-source training guarantees a sharper bound than single-source training.We further instantiate the general theory on conditional Gaussian estimation and deep generative models including autoregressive and flexible energy-based models, by characterizing their bracketing numbers. The results highlight that the number of sources and similarity among source distributions improve the advantage of multi-source training. Simulations and real-world experiments validate our theory.
Paperid:2069
Authors:Shuai Yi · Yixiong Zou · Yuhua Li · Ruixuan Li
Abstract:
Vision Transformer (ViT) has achieved remarkable success due to its largescale pretraining on general domains, but it still faces challenges when applying it to downstream distant domains that have only scarce training data, which gives rise to the Cross-Domain Few-Shot Learning (CDFSL) task. Inspired by Self-Attention's insensitivity to token orders, we find an interesting phenomenon neglected in current works: disrupting the continuity of image tokens (i.e., making pixels not smoothly transited across patches) in ViT leads to a noticeable performance decline in the general (source) domain but only a marginal decrease in downstream target domains. This questions the role of image tokens' continuity in ViT's generalization under large domain gaps. In this paper, we delve into this phenomenon for an interpretation. We find continuity aids ViT in learning larger spatial patterns, which are harder to transfer than smaller ones, enlarging domain distances. Meanwhile, it implies only smaller patterns within each patch could be transferred under extreme domain gaps. Based on this interpretation, we further propose a simple yet effective method for CDFSL that better disrupts the continuity of image tokens, encouraging the model to rely less on large patterns and more on smaller ones. Extensive experiments show the effectiveness of our method in reducing domain gaps and outperforming state-of-the-art works. Our codes will be released.
Paperid:2070
Authors:Christophe Vauthier · Anna Korba · Quentin Mérigot
Abstract:
In this paper, we investigate the properties of the Sliced Wasserstein Distance (SW) when employed as an objective functional. The SW metric has gained significant interest in the optimal transport and machine learning literature, due to its ability to capture intricate geometric properties of probability distributions while remaining computationally tractable, making it a valuable tool for various applications, including generative modeling and domain adaptation.Our study aims to provide a rigorous analysis of the critical points arising from the optimization of the SW objective. By computing explicit perturbations, we establish that stable critical points of SW cannot concentrate on segments. This stability analysis is crucial for understanding the behaviour of optimization algorithms for models trained using the SW objective. Furthermore, we investigate the properties of the SW objective, shedding light on the existence and convergence behavior of critical points. We illustrate our theoretical results through numerical experiments.
Paperid:2071
Authors:Zhengyi Li · Yue Guan · Kang Yang · Yu Feng · Ning Liu · Yu Yu · Jingwen Leng · Minyi Guo
Abstract:
Abstract:The wide deployment of the generative pretrained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance overhead. To accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using knowledge distillation. Our approach improves the efficiency of secure decoding while maintaining the same level of privacy and generation quality as standard secure decoding. Experiments demonstrate a $2.1\times \sim 6.0\times$ speedup compared to standard decoding across three pairs of public-private models and different network conditions.
Paperid:2072
Authors:Côme Fiegel · Pierre Menard · Tadashi Kozuno · Michal Valko · Vianney Perchet
Abstract:
Abstract:We study the problem of learning in zerosum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorithms that guarantee, without communication between players, convergence of the last-iterate to a Nash equilibrium. Although the non-bandit case has been studied extensively, this setting has only been explored recently, with a bound of $\mathcal{O}(T^{-1/8})$ on the exploitability gap. We show that, for uncoupled algorithms, guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to the performances, with the best attainable rate being $\mathcal{O}(T^{-1/4})$ in contrast to the usual $\mathcal{O}(T^{-1/2})$ rate for convergence of the average iterates. We then propose two algorithms that achieve this optimal rate. The first algorithm leverages a straightforward tradeoff between exploration and exploitation, while the second employs a regularization technique based on a two-step mirror descent approach.
Paperid:2073
Authors:Wonyoung Kim · Sungwoo PARK · Garud Iyengar · Assaf Zeevi · Min-hwan Oh
Abstract:
Abstract:We introduce a novel linear bandit problem with partially observable features, resulting in partial reward information and spurious estimates.Without proper address for latent part, regret possibly grows linearly in decision horizon $T$, as their influence on rewards are unknown.To tackle this, we propose a novel analysis to handle the latent features and an algorithm that achieves sublinear regret.The core of our algorithm involves (i) augmenting basis vectors orthogonal to the observed feature space, and (ii) introducing an efficient doubly robust estimator.Our approach achieves a regret bound of $\tilde{O}(\sqrt{(d + d\_h)T})$, where $d$ is the dimension of observed features, and $d\_h$ is the \emph{unknown} dimension of the subspace of the unobserved features. Notably, our algorithm requires no prior knowledge of the unobserved feature space, which may expand as more features become hidden.Numerical experiments confirm that our algorithm outperforms both noncontextual multi-armed bandits and linear bandit algorithms depending solely on observed features.
Paperid:2074
Authors:Lincan Cai · Jingxuan Kang · Shuang Li · Wenxuan Ma · Binhui Xie · Zhida Qin · Jian Liang
Abstract:
Pretrained visionlanguage models (VLMs), e.g., CLIP, demonstrate impressive zero-shot capabilities on downstream tasks. Prior research highlights the crucial role of visual augmentation techniques, like random cropping, in alignment with fine-grained class descriptions generated by large language models (LLMs), significantly enhancing zero-shot performance by incorporating multi-view information. However, the inherent randomness of these augmentations can inevitably introduce background artifacts and cause models to overly focus on local details, compromising global semantic understanding. To address these issues, we propose anAttention-BasedSelection (ABS) method from local details to global context, which applies attention-guided cropping in both raw images and feature space, supplement global semantic information through strategic feature selection. Additionally, we introduce a soft matching technique to effectively filter LLM descriptions for better alignment.ABSachieves state-of-the-art performance on out-of-distribution generalization and zero-shot classification tasks. Notably,ABSis training-free and even rivals few-shot and test-time adaptation methods.
Paperid:2075
Authors:Anuran Makur · Japneet Singh
Abstract:
Abstract:In this work, we develop a hypothesis testing framework to determine whether pairwise comparison data is generated by an underlying *generalized Thurstone model* $\mathcal{T}_F$ for a given choice function $F$. While prior work has predominantly focused on parameter estimation and uncertainty quantification for such models, we address the fundamental problem of hypothesis testing of $\mathcal{T}_F$ models. We introduce a minimax separation distance to quantify the gap between general pairwise comparison models and the class of $\mathcal{T}_F$ models and derive upper and lower bounds on the critical threshold that depend on the topology of the observation graph. For the special case of a complete observation graph, this threshold scales as $\Theta((nk)^{1/2})$, where $n$ is the number of agents and $k$ is the number of comparisons per pair. Furthermore, we propose a hypothesis test based on our separation distance, construct confidence intervals, and establish time-uniform bounds on type I and II errors using reverse martingale techniques. Additionally, we derive minimax lower bounds via information-theoretic methods and validate our results through experiments on synthetic and real-world datasets.
Paperid:2076
Authors:Taoyang Qin · Ke-Jia CHEN · Zheng Liu
Abstract:
Spectral Graph Neural Networks process graph signals using the spectral properties of the normalized graph Laplacian matrix. However, the frequent occurrence of repeated eigenvalues limits the expressiveness of spectral GNNs. To address this, we propose a higherdimensional sheaf Laplacian matrix, which not only encodes the graph's topological information but also increases the upper bound on the number of distinct eigenvalues. The sheaf Laplacian matrix is derived from carefully designed perturbations of the block form of the normalized graph Laplacian, yielding a perturbed sheaf Laplacian (PSL) matrix with more distinct eigenvalues. We provide a theoretical analysis of the expressiveness of spectral GNNs equipped with the PSL and establish perturbation bounds for the eigenvalues. Extensive experiments on benchmark datasets for node classification demonstrate that incorporating the perturbed sheaf Laplacian enhances the performance of spectral GNNs.
Paperid:2077
Authors:Zichen Liu · Xu Zou · Gang Hua · Jiahuan Zhou
Abstract:
Visual prompting techniques are widely used to efficiently finetune pretrained Vision Transformers (ViT) by learning a small set of shared prompts for all tokens. However, existing methods overlook the unique roles of different tokens in conveying discriminative information and interact with all tokens using the same prompts, thereby limiting the representational capacity of ViT. This often leads to indistinguishable and biased prompt-extracted features, hindering performance. To address this issue, we propose a plug-and-play Token Coordinated Prompt Attention (TCPA) module, which assigns specific coordinated prompts to different tokens for attention-based interactions. Firstly, recognizing the distinct functions of CLS and image tokens—global information aggregation and local feature extraction, we disentangle the prompts into CLS Prompts and Image Prompts, which interact exclusively with CLS tokens and image tokens through attention mechanisms. This enhances their respective discriminative abilities. Furthermore, as different image tokens correspond to distinct image patches and contain diverse information, we employ a matching function to automatically assign coordinated prompts to individual tokens. This enables more precise and targeted attention interactions, significantly improving the diversity and representational capacity of the extracted features. Extensive experiments across various benchmarks and state-of-the-art methods demonstrate that TCPA significantly enhances the diversity and discriminative power of the extracted features. The code will be released publicly.
Paperid:2078
Authors:Alireza Amiribavandpour · Xinting Huang · Mark Rofin · Michael Hahn
Abstract:
Abstract:Chainof-thought reasoning and scratchpads have emerged as critical tools for enhancing the computational capabilities of transformers. While theoretical results show that polynomial-length scratchpads can extend transformers' expressivity from $TC^0$ to $PTIME$, their required length remains poorly understood. Empirical evidence even suggests that transformers need scratchpads even for many problems in $TC^0$, such as Parity or Multiplication, challenging optimistic bounds derived from circuit complexity. In this work, we initiate the study of systematic lower bounds for the number of CoT steps across different algorithmic problems, in the hard-attention regime. We study a variety of algorithmic problems, and provide bounds that are tight up to logarithmic factors. Overall, these results contribute to emerging understanding of the power and limitations of chain-of-thought reasoning.
Paperid:2079
Authors:Kunda Yan · Min Zhang · Sen Cui · Qu Zikun · Bo Jiang · Feng Liu · Changshui Zhang
Abstract:
Model merging aims to integrate the strengths of multiple finetuned models into a unified model while preserving task-specific capabilities. Existing methods, represented by task arithmetic, are typically classified into global- and local-aware methods. However, global-aware methods inevitably cause parameter interference, while local-aware methods struggle to maintain the effectiveness of task-specific details in the merged model. To address these limitations, we propose a Consensus-Aware Localized Merging (CALM) method which incorporates localized information aligned with global task consensus, ensuring its effectiveness post-merging. CALM consists of three key components: (1) class-balanced entropy minimizationsampling, providing a more flexible and reliable way to leverage unsupervised data; (2) an efficient-aware framework, selecting a small set of tasks for sequential merging with high scalability; (3) a consensus-aware mask optimization, aligning localized binary masks with global task consensus and merging them conflict-free. Experiments demonstrate the superiority and robustness of our CALM, significantly outperforming existing baseline methods. Code is available at https://anonymous.4open.science/r/CALM.
Paperid:2080
Authors:Jing Huang · Junyi Tao · Thomas Icard · Diyi Yang · Christopher Potts
Abstract:
Does understanding model internals help us to predict the correctness of model outputs? In the present work, we provide a positive answer to this question. Through a diverse set of language modeling tasksincluding symbol manipulation, knowledge retrieval, and instruction following--we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior. Specifically, we propose two methods that leverage causal mechanisms for predicting correctness. Both outperform methods that rely on non-causal features, especially under distribution shifts, where such estimates are most crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.
Paperid:2081
Authors:Talor Abramovich · Meet Udeshi · Minghao Shao · Kilian Lieret · Haoran Xi · Kimberly Milner · Sofija Jancheska · John Yang · Carlos Jimenez · Farshad Khorrami · Prashanth Krishnamurthy · Brendan Dolan-Gavitt · Muhammad Shafique · Karthik Narasimhan · Ramesh Karri · Ofir Press
Abstract:
Although language model (LM) agents have demonstrated increased performance in multiple domains, including coding and webbrowsing, their success in cybersecurity has been limited. We presentEnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. We introduce new tools and interfaces to improve the agent's ability to find and exploit security vulnerabilities, focusing on interactive terminal programs. These novelInteractive Agent Toolsenable LM agents, for the first time, to run interactive utilities, such as a debugger and a server connection tool, which are essential for solving these challenges.Empirical analysis on 390 CTF challenges across four benchmarks demonstrate that these new tools and interfaces substantially improve our agent's performance, achieving state-of-the-art results on NYU CTF, Intercode-CTF, and CyBench. Finally, we analyze data leakage, developing new methods to quantify it and identifying a new phenomenon we termsoliloquizing, where the model self-generates hallucinated observations without interacting with the environment.
Paperid:2082
Authors:Weilun Feng · Chuanguang Yang · Haotong Qin · Xiangqi Li · Yu Wang · Zhulin An · Libo Huang · Boyu Diao · Zixiang Zhao · Yongjun Xu · Michele Magno
Abstract:
Abstract:Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bitwidth of model parameters.Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present **Q-VDiT**, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the *Token aware Quantization Estimator* (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce *Temporal Maintenance Distillation* (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency score of 23.40, setting a new benchmark and outperforming the current state-of-the-art quantization methods by **1.9$\times$**.
Paperid:2083
Authors:Manwen Liao · Yan Zhu · Weitian Zhang · Yuxiang Yang
Abstract:
Characterizing quantum states is essential for advancing many quantum technologies. Recently, deep neural networks have been applied to learn quantum states by generating implicit representations that map them into classical vectors. Despite their success in predicting properties of the states, these representations remain a black box, lacking insights into strategies for experimental reconstruction. In this work, we aim to open this black box by developing explicit representations of quantum states through the generation of a surrogate preparation circuits for property estimation using a reinforcement learning agent with a local fidelity reward function. Relying solely on measurement data from a few neighboring qubits, our agent accurately recovers properties of target states. Specifically, we design a quantum measurement feature aggregation block which is used to extract global features of quantum states from local measurement data. We also theoretically analyze the global fidelity the agent can achieve when it learns a good local approximation. Extensive experiments demonstrate the effectiveness of our framework in learning various quantum states of up to 100 qubits, including those generated by shallow Instantaneous Quantum Polynomial circuits, evolved by Ising Hamiltonians, and manybody ground states. The learned circuit representations can be further applied to Hamiltonian learning as a downstream task utilizing a simple linear model.
Paperid:2084
Authors:Yu Wang · Dmitry Krotov · Yuanzhe Hu · Yifan Gao · Wangchunshu Zhou · Julian McAuley · Dan Gutfreund · Rogerio Feris · Zexue He
Abstract:
Equipping large language models (LLMs) with latentspace memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths up to 16k tokens, it struggles to retain knowledge beyond 20k tokens. In this work, we address this limitation by introducing M+, a memory-augmented model based on MemoryLLM that significantly enhances long-term information retention. M+ integrates a long-term memory mechanism with a co-trained retriever, dynamically retrieving relevant information during text generation. We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms MemoryLLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead.
Paperid:2085
Authors:Yiming Sun · Xin Li · Pengfei Zhu · Qinghua Hu · Dongwei Ren · Huiying Xu · Xinzhong Zhu
Abstract:
Multimodal image fusion integrates information from different imaging modalities and is crucial for applications such as rescue and surveillance. However, multimodal images often suffer from degradations, including noise, blur, and haze in visible images, as well as stripe noise in infrared images, which degrade fusion quality and performance. To address these challenges, we propose the TaskGated Multi-Expert Collaboration Network (TG-ECNet), a novel framework for degraded multimodal image fusion. The network incorporates a task-gated router, which consists of degradation-aware gating in the encoder and fusion-aware gating in the decoder. Degradation-aware gating dynamically selects processing pathways based on degradation types, while fusion-aware gating aggregates effective features from multiple modalities to achieve higher-quality fusion. Building on this task-gated router, we design a multi-expert collaborative framework that bridges image restoration and fusion tasks through a two-stage training strategy. This strategy balances the two tasks and minimizes interference during optimization by decoupling the learning processes, ultimately enabling all-in-one image restoration and fusion. Experimental results on multiple benchmark datasets for degraded image restoration and fusion demonstrate that TG-ECNet significantly improves fusion performance under various degradations, enhancing robustness for real-world applications.
Paperid:2086
Authors:Judy Hanwen Shen · Ellen Vitercik · Anders Wikum
Abstract:
The field ofalgorithms with predictionsincorporates machine learning advice in the design of online algorithms to improve realworld performance. While this theoretical framework often assumes uniform reliability across all predictions, modern machine learning models can now provide instance-level uncertainty estimates. In this paper, we proposecalibrationas a principled and practical tool to bridge this gap, demonstrating the benefits of calibrated advice through two case studies: theski rentalandonline job schedulingproblems. For ski rental, we design an algorithm that achieves optimal prediction-dependent performance and prove that, in high-variance settings, calibrated advice offers more effective guidance than alternative methods for uncertainty quantification. For job scheduling, we demonstrate that using a calibrated predictor leads to significant performance improvements over existing methods. Evaluations on real-world data validate our theoretical findings, highlighting the practical impact of calibration for algorithms with predictions.
Paperid:2087
Authors:Feiran Li · Qianqian Xu · Shilong Bao · Zhiyong Yang · Xiaochun Cao · Qingming Huang
Abstract:
Concept erasing has recently emerged as an effective paradigm to prevent textto-image diffusion models from generating visually undesirable or even harmful content. However, current removal methods heavily rely on manually crafted text prompts, making it challenging to achieve a high erasure (efficacy) while minimizing the impact on other benign concepts (usability), as illustrated in Fig.1. In this paper, we attribute the limitations to the inherent gap between the text and image modalities, which makes it hard to transfer the intricately entangled concept knowledge from text prompts to the image generation process. To address this, we propose a novel solution by directly integrating visual supervision into the erasure process, introducing the first text-image Collaborative Concept Erasing (Co-Erasing) framework. Specifically, Co-Erasing describes the concept jointly by text prompts and the corresponding undesirable images induced by the prompts, and then reduces the generating probability of the target concept through negative guidance. This approach effectively bypasses the knowledge gap between text and image, significantly enhancing erasure efficacy. Additionally, we design a text-guided image concept refinement strategy that directs the model to focus on visual features most relevant to the specified text concept, minimizing disruption to other benign concepts. Finally, comprehensive experiments suggest that Co-Erasing outperforms state-of-the-art erasure approaches significantly with a better trade-off between efficacy and usability.
Paperid:2088
Authors:Kaixuan Huang · Jiacheng Guo · Zihao Li · Xiang Ji · Jiawei Ge · Wenzhe Li · Yingqing Guo · Tianle Cai · Hui Yuan · Runzhe Wang · Yue Wu · Ming Yin · Shange Tang · Yangsibo Huang · Chi Jin · Xinyun Chen · Chiyuan Zhang · Mengdi Wang
Abstract:
A genuinely robust reasoning model should be able to function correctly when the problem statement is modified outof-training-distribution. Prior work has shown that language models struggle on mathematical benchmarks when questions undergo simple perturbations -- modifications that still preserve the underlying reasoning patterns of the solutions. However, no work has explored hard perturbations, which fundamentally change the nature of the problem so that the original solution steps do not apply. To bridge the gap, we construct MATH-P-Simple and MATH-P-Hard via simple perturbation and hard perturbation, respectively. Each consists of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycks et. al., 2021). We observe significant performance drops on MATH-P-Hard across various models, including o1-mini (-16.49%) and gemini-2.0-flash-thinking (-12.9%). We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models.
Paperid:2089
Authors:Saeed Mahloujifar · Luca Melis · Kamalika Chaudhuri
Abstract:
Abstract:Empirical auditing has emerged as a means of catching some of the flaws in the implementation of privacypreserving algorithms. Existing auditing mechanisms, however, are either computationally inefficient -- requiring multiple runs of the machine learning algorithms —- or suboptimal in calculating an empirical privacy. In this work, we present a tight and efficient auditing procedure and analysis that can effectively assess the privacy of mechanisms. Our approach is efficient; Similar to the recent work of Steinke, Nasr and Jagielski (2023), our auditing procedure leverages the randomness of examples in the input dataset and requires only a single run of the target mechanism. And it is more accurate; we provide a novel analysis that enables us to achieve tight empirical privacy estimates by using the hypothesized $f$-DP curve of the mechanism, which provides a more accurate measure of privacy than the traditional $\epsilon,\delta$ differential privacy parameters. We use our auditing procure and analysis to obtain empirical privacy, demonstrating that our auditing procedure delivers tighter privacy estimates.
Paperid:2090
Authors:Yuqi Fan · Zhiyong Cui · Zhenning Li · Yilong Ren · Haiyang Yu
Abstract:
Reliable planning is crucial for achieving autonomous driving. Rulebased planners are efficient but lack generalization, while learning-based planners excel in generalization yet have limitations in real-time performance and interpretability. In long-tail scenarios, these challenges make planning particularly difficult. To leverage the strengths of both rule-based and learning-based planners, we proposed theScenario-Aware Hybrid Planner(SAH-Drive) for closed-loop vehicle trajectory planning. Inspired by human driving behavior, SAH-Drive combines a lightweight rule-based planner and an comprehensive learning-based planner, utilizing a dual-timescale decision neuron to determine the final trajectory. To enhance the computational efficiency and robustness of the hybrid planner, we also employed a diffusion proposal number regulator and trajectory fusion techniques. The experimental results show that the proposed method significantly improves the generalization capability of the planning system, achieving state-of-the-art performance in interPlan, while maintaining computational efficiency without incurring substantial additional runtime.
Paperid:2091
Authors:Yuxuan Song · Juntong Shi · Jingjing Gong · Minkai Xu · Stefano Ermon · Hao Zhou · Wei-Ying Ma
Abstract:
Though typically represented by the discrete node and edge attributes, the graph topological information can be sufficiently captured by the graph spectrum in a continuous space. It is believed that incorporating the continuity of graph topological information into the generative process design could establish a superior paradigm for graph generative modeling. Motivated by such prior and recent advancements in the generative paradigm, we propose Graph Bayesian Flow Networks (GraphBFN) in this paper, a principled generative framework that designs an alternative generative process emphasizing the dynamics of topological information. Unlike recent discretediffusion-based methods, GraphBFNemploys the continuous counts derived from sampling infinite times from a categorical distribution as latent to facilitate a smooth decomposition of topological information, demonstrating enhanced effectiveness. To effectively realize the concept, we further develop an advanced sampling strategy and new time-scheduling techniques to overcome practical barriers and boost performance. Through extensive experimental validation on both generic graph and molecular graph generation tasks, GraphBFN could consistently achieve superior or competitive performance with significantly higher training and sampling efficiency.
Paperid:2092
Authors:Xing Xi · Xing Fu · Weiqiang Wang · Ronghua Luo
Abstract:
Open World Object Detection (OWOD) requires the detector to continuously identify and learn new categories. Existing methods rely on the large language model (LLM) to describe the visual attributes of known categories and use these attributes to mark potential objects. The performance of such methods is influenced by the accuracy of LLM descriptions, and selecting appropriate attributes during incremental learning remains a challenge. In this paper, we propose a novel OWOD framework, termed OWVAP, which operates independently of LLM and requires only minimal object descriptions to detect unknown objects. Specifically, we propose a Visual Attribute Parser (VAP) that parses the attributes of visual regions and assesses object potential based on the similarity between these attributes and the object descriptions. To enable the VAP to recognize objects in unlabeled areas, we exploit potential objects within background regions. Finally, we propose Probabilistic Soft Label Assignment (PSLA) to prevent optimization conflicts from misidentifying background as foreground. Comparative results on the OWOD benchmark demonstrate that our approach surpasses existing state-of-the-art methods with a +13 improvement in U-Recall and a +8 increase in U-AP for unknown detection capabilities. Furthermore, OW-VAP approaches the unknown recall upper limit of the detector.
Paperid:2093
Authors:Nuno Gonçalves · Marcos V. Treviso · Andre Martins
Abstract:
Abstract:The computational cost of softmaxbased attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $\alpha$-entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $\alpha$-entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the $\alpha$-entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing $\alpha$-entmax implementations. It approaches---and in some cases surpasses---the efficiency of highly optimized softmax implementations like FlashAttention-2, enabling long-context training while maintaining strong task performance.
Paperid:2094
Authors:Marko Medvedev · Kaifeng Lyu · Dingli Yu · Sanjeev Arora · Zhiyuan Li · Nati Srebro
Abstract:
Weakto-Strong Generalization (Burns et al.,2023) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher,say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenondoes not require a strong learner like GPT-4.We consider student and teacher that are randomfeature models, described by two-layer networkswith a random and fixed bottom layer and trainedtop layer. A ‘weak’ teacher, with a small number of units (i.e. random features), is trained onthe population, and a ‘strong’ student, with amuch larger number of units (i.e. random features), is trained only on labels generated by theweak teacher. We demonstrate, prove and under-stand, how the student can outperform the teacher,even though trained only on data labeled by theteacher. We also explain how such weak-to-stronggeneralization is enabled by early stopping. Importantly, we also show the quantitative limits ofweak-to-strong generalization in this model.
Paperid:2095
Authors:Alex Clinton · Yiding Chen · Jerry Zhu · Kirthevasan Kandasamy
Abstract:
Abstract:We study a collaborative learning problem where $m$ agents aim to estimate a vector $\mu =(\mu_1,\ldots,\mu_d)\in \mathbb{R}^d$ by sampling from associated univariate normal distributions $\{\mathcal{N}(\mu_k, \sigma^2)\}_{k\in[d]}$. Agent $i$ incurs a cost $c_{i,k}$ to sample from $\mathcal{N}(\mu_k, \sigma^2)$. Instead of working independently, agents can exchange data, collecting cheaper samples and sharing them in return for costly data, thereby reducing both costs and estimation error. We design a mechanism to facilitate such collaboration, while addressing two key challenges: ensuring *individually rational (IR) and fair outcomes* so all agents benefit, and *preventing strategic behavior* (e.g. noncollection, data fabrication) to avoid socially undesirable outcomes.We design a mechanism and an associated Nash equilibrium (NE) which minimizes the social penalty-sum of agents' estimation errors and collection costs-while being IR for all agents. We achieve a $\mathcal{O}(\sqrt{m})$-approximation to the minimum social penalty in the worst case and an $\mathcal{O}(1)$-approximation under favorable conditions. Additionally, we establish three hardness results: no nontrivial mechanism guarantees *(i)* a dominant strategy equilibrium where agents report truthfully, *(ii)* is IR for every strategy profile of other agents, *(iii)* or avoids a worst-case $\Omega(\sqrt{m})$ price of stability in any NE. Finally, by integrating concepts from axiomatic bargaining, we demonstrate that our mechanism supports fairer outcomes than one which minimizes social penalty.
Paperid:2096
Authors:Taneesh Gupta · Rahul Madhavan · Xuchao Zhang · Chetan Bansal · Saravanakumar Rajmohan
Abstract:
Multipreference optimization enriches language-model alignment beyond pairwise preferences by contrasting entire sets of helpful and undesired responses, enabling richer training signals for large language models. During self-play alignment, these models often produce numerous candidate answers per query, making it computationally infeasible to include all of them in the training objective. We propose Active Multi-Preference Optimization (AMPO), which combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection. Specifically, we score and embed large candidate pools of responses, then pick a small but informative subset—covering reward extremes and distinct semantic clusters—for preference optimization.The resulting contrastive-training scheme identifies not only the best and worst answers but also subtle, underexplored modes crucial for robust alignment. Theoretically, we provide guarantees of expected reward maximization using our active selection method.Empirically, AMPO achieves state-of-the-art results on AlpacaEval with Llama 8B.
Paperid:2097
Authors:Saqib Javed · Hieu Le · Mathieu Salzmann
Abstract:
A key challenge in Domain Generalization (DG) is preventing overfitting to source domains, which can be mitigated by finding flatter minima in the loss landscape. In this work, we propose Quantizationaware Training for Domain Generalization (QT-DoG) and demonstrate that weight quantization effectively leads to flatter minima in the loss landscape, thereby enhancing domain generalization. Unlike traditional quantization methods focused on model compression, QT-DoG exploits quantization as an implicit regularizer by inducing noise in model weights, guiding the optimization process toward flatter minima that are less sensitive to perturbations and overfitting. We provide both an analytical perspective and empirical evidence demonstrating that quantization inherently encourages flatter minima, leading to better generalization across domains. Moreover, with the benefit of reducing the model size through quantization, we demonstrate that an ensemble of multiple quantized models further yields superior accuracy than the state-of-the-art DG approaches with no computational or memory overheads.
Paperid:2098
Authors:Yonathan Efroni · Ben Kretzu · Daniel Jiang · Jalaj Bhandari · Zheqing Zhu · Karen Ullrich
Abstract:
To date, the multiobjective optimization literature has mainly focused on conflicting objectives, studying the Pareto front, or requiring users to balance tradeoffs. Yet, in machine learning practice, there are many scenarios where such conflict does not take place. Recent findings from multi-task learning, reinforcement learning, and LLMs training show that diverse related tasks can enhance performance across objectives simultaneously. Despite this evidence, such phenomenon has not been examined from an optimization perspective. This leads to a lack of generic gradient-based methods that can scale to scenarios with a large number of related objectives. To address this gap, we introduce the Aligned Multi-Objective Optimization framework, propose new algorithms for this setting, and provide theoretical guarantees of its superior performance compared to naive approaches.
Paperid:2099
Authors:Renaud Gaucher · Aymeric Dieuleveut · Hadrien Hendrikx
Abstract:
Abstract:In decentralized machine learning, different devices communicate in a peerto-peer manner to collaboratively learn from each other's data. Such approaches are vulnerable to misbehaving (or Byzantine) devices. We introduce $\mathrm{F}\text{-}\mathrm{RG}$, a general framework for building robust decentralized algorithms with guarantees arising from robust-sum-like aggregation rules $\mathrm{F}$. We then investigate the notion of *breakdown point*, and show an upper bound on the number of adversaries that decentralized algorithms can tolerate. We introduce a practical robust aggregation rule, coined $\mathrm{CS_{ours}}$, such that $\mathrm{CS_{ours}\text{-}RG}$ has a near-optimal breakdown. Other choices of aggregation rules lead to existing algorithms such as $\rm ClippedGossip$ or $\rm NNA$. We give experimental evidence to validate the effectiveness of $\mathrm{CS_{ours}\text{-}RG}$ and highlight the gap with $\mathrm{NNA}$, in particular against a novel attack tailored to decentralized communications.
Paperid:2100
Authors:Junwei Su · Chuan Wu
Abstract:
This paper studies the interplay between learning algorithms and graph structure for graph neural networks (GNNs). Existing theoretical studies on the learning dynamics of GNNs primarily focus on the convergence rates of learning algorithms under the interpolation regime (noisefree) and offer only a crude connection between these dynamics and the actual graph structure (e.g., maximum degree). This paper aims to bridge this gap by investigating the excessive risk (generalization performance) of learning algorithms in GNNs within the generalization regime (with noise). Specifically, we extend the conventional settings from the learning theory literature to the context of GNNs and examine how graph structure influences the performance of learning algorithms such as stochastic gradient descent (SGD) and Ridge regression. Our study makes several key contributions toward understanding the interplay between graph structure and learning in GNNs. First, we derive the excess risk profiles of SGD and Ridge regression in GNNs and connect these profiles to the graph structure through spectral graph theory. With this established framework, we further explore how different graph structures (regular vs. power-law) impact the performance of these algorithms through comparative analysis. Additionally, we extend our analysis to multi-layer linear GNNs, revealing an increasing non-isotropic effect on the excess risk profile, thereby offering new insights into the over-smoothing issue in GNNs from the perspective of learning algorithms. Our empirical results align with our theoretical predictions, \emph{collectively showcasing a coupling relation among graph structure, GNNs and learning algorithms, and providing insights on GNN algorithm design and selection in practice.}
Paperid:2101
Authors:Gizem Yüce · Aditya Vardhan Varre · Nicolas Flammarion
Abstract:
Abstract:In this article, we explore the loss landscape of nexttoken prediction with transformers. Specifically, we focus on learning in-context n-gram language models with cross-entropy loss using a simplified two-layer transformer. We design a series of transformers that represent $k$-grams (for $k \leq n$) for which the gradient of the population loss approaches zero in the limit of infinite sequence length. This construction reveals a key property of the loss landscape: \emph{$k$-grams are stationary points of the population cross-entropy loss}, offering theoretical insights for widely observed empirical phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by comprehensive numerical experiments that illustrate the dynamics of learning $n$-grams, characterized by jumps between stationary points.
Paperid:2102
Authors:Edward Yeo · Yuxuan Tong · Xinyao Niu · Graham Neubig · Xiang Yue
Abstract:
Scaling inference compute has become a key driver of advanced reasoning in large language models (LLMs). A proven approach for scaling inference compute is to generate long chainsof-thought (CoTs), enabling models to engage in structured reasoning strategies such as backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the underlyingmechanics of long CoT reasoning—examining the factors that enable models to generate extended reasoning trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we identify three key findings: 1) while SFT is not strictly necessary, it significantly simplifies training and improves efficiency; 2) reasoning capabilities tend to emerge with increased training compute but are not guaranteed, making reward shaping essential for stabilizing CoT length growth; and 3) scaling verifiable reward signals is critical for RL, and we find that leveraging noisy, web-extracted solutions with filtering mechanisms shows promising potential, particularly in out-of-distribution (OOD) reasoning tasks such as STEM problem-solving. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs.
Paperid:2103
Authors:Zonghao Chen · Masha Naslidnyk · Francois-Xavier Briol
Abstract:
This paper considers the challenging computational task of estimating nested expectations. Existing algorithms, such as nested Monte Carlo or multilevel Monte Carlo, are known to be consistent but require a large number of samples at both inner and outer levels to converge. Instead, we propose a novel estimator consisting of nested kernel quadrature estimators and we prove that it has a faster convergence rate than all baseline methods when the integrands have sufficient smoothness. We then demonstrate empirically that our proposed method does indeed require the fewest number of samples to estimate nested expectations over a range of realworld application areas from Bayesian optimisation to option pricing and health economics.
Paperid:2104
Authors:Bingchen Miao · Yang Wu · Minghe Gao · Qifan Yu · Wendong Bu · Wenqiao Zhang · liyunfei · Siliang Tang · Tat-Seng Chua · Juncheng Li
Abstract:
The development of Generalist Virtual Agents (GVAs) powered by Multimodal Large Language Models (MLLMs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and laborintensive human annotations. To address these challenges, we proposeSimilar, aStep-wiseMulti-dimensionalGeneralistReward Model, which offers fine-grained signals for agent training and can choose better action for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we trainSimilarwith the Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, namedSRM. This benchmark consists of two components:SRMTrain, which serves as the training set forSimilar, andSRMEval, a manually selected test set for evaluating the reward model. Experimental results demonstrate thatSimilar,through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling. The code is available in https://anonymous.4open.science/r/Similar-F0F3.
Paperid:2105
Authors:Congcong Zhu · Xiaoyan Xu · Jiayue Han · Jingrun Chen
Abstract:
Autoregressive partial differential equations (PDEs) foundation models have shown great potential in handling time-dependent data. However, these models suffer from the shortcut problem deeply rooted in auto-regressive prediction, causing error accumulation. The challenge becomes particularly evident for out-of-distribution data, as the pretraining performance may approach random model initialization for downstream tasks with long-term dynamics. To deal with this problem, we propose physics-informed temporal alignment (PITA), a self-supervised learning framework inspired by inverse problem solving. Specifically, PITA aligns the physical dynamics discovered at different time steps on each given PDE trajectory by integrating physics-informed constraints into the self-supervision signal. The alignment is derived from observation data without relying on known physics priors, indicating strong generalization ability to the out-of-distribution data. Extensive experiments show that PITA significantly enhances the accuracy and robustness of existing foundation models on diverse time-dependent PDE data.
Paperid:2106
Authors:Emanuele Troiani · Hugo Cui · Yatin Dandi · FLORENT KRZAKALA · Lenka Zdeborova
Abstract:
Abstract:In this manuscript, we study the learning of deep attention neural networks, defined as the composition of multiple selfattention layers, with tied and low-rank weights. We first establish a mapping of such models to sequence multi-index models, a generalization of the widely studied multi-index model to sequential covariates, for which we establish a number of general results. In the context of Bayesian-optimal learning, in the limit of large dimension $D$ and commensurably large number of samples $N$, we derive a sharp asymptotic characterization of the optimal performance as well as the performance of the best-known polynomial-time algorithm for this setting --namely approximate message-passing--, and characterize sharp thresholds on the minimal sample complexity required for better-than-random prediction performance. Our analysis uncovers, in particular, how the different layers are learned sequentially. Finally, we discuss how this sequential learning can also be observed in a realistic setup.
Paperid:2107
Authors:Kaiwen Zheng · Yongxin Chen · Huayu Chen · Guande He · Ming-Yu Liu · Jun Zhu · Qinsheng Zhang
Abstract:
Abstract:While likelihoodbased generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that bridges likelihood-based generative training and the GAN objective to bypass this fundamental constraint. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1\% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58 to new records of 1.30/0.99 on CIFAR-10/ImageNet-64 datasets, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256$\times$256.
Paperid:2108
Authors:Jacqueline Maasch · Alihan Hüyük · Xinnuo Xu · Aditya Nori · Javier Gonzalez
Abstract:
Causal reasoning and compositional reasoning are two core aspirations in generative AI. Measuring the extent of these behaviors requires principled evaluation methods. We explore a unified perspective that considers both behaviors simultaneously, termedcompositional causal reasoning(CCR): the ability to infer how causal measures compose and, equivalently, how causal quantities propagate through graphs. We instantiate a framework for the systematic evaluation of CCR for the average treatment effect and the probability of necessity and sufficiency. As proof of concept, we demonstrate the design of CCR tasks for language models in the LLama, Phi, and GPT families. On a math word problem, CCR errors increased with the complexity of causal paths.
Paperid:2109
Authors:Junwei Deng · Weijing Tang · Jiaqi Ma
Abstract:
Influence function, a technique rooted in robust statistics, has been adapted in modern machine learning for a novel application: data attributionquantifying how individual training data points affect a model's predictions. However, the common derivation of influence functions in the data attribution literature is limited to loss functions that decompose into a sum of individual data point losses, with the most prominent examples known as M-estimators. This restricts the application of influence functions to more complex learning objectives, which we refer to as non-decomposable losses, such as contrastive or ranking losses, where a unit loss term depends on multiple data points and cannot be decomposed further. In this work, we bridge this gap by revisiting the general formulation of influence function from robust statistics, which extends beyond M-estimators. Based on this formulation, we propose a novel method, the Versatile Influence Function (VIF), that can be straightforwardly applied to machine learning models trained with any non-decomposable loss. In comparison to the classical approach in statistics, the proposed VIF is designed to fully leverage the power of auto-differentiation, hereby eliminating the need for case-specific derivations of each loss function. We demonstrate the effectiveness of VIF across three examples: Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank for information retrieval. In all cases, the influence estimated by VIF closely resembles the results obtained by brute-force leave-one-out retraining, while being up to 1000 times faster to compute. We believe VIF represents a significant advancement in data attribution, enabling efficient influence-function-based attribution across a wide range of machine learning paradigms, with broad potential for practical use cases.
Paperid:2110
Authors:Filip Ekström Kelvinius · Oskar Andersson · Abhijith Parackal · Dong Qian · Rickard Armiento · Fredrik Lindsten
Abstract:
Crystalline materials often exhibit a high level of symmetry. However, most generative models do not account for symmetry, but rather model each atom without any constraints on its position or element. We propose a generative model, Wyckoff Diffusion (WyckoffDiff), which generates symmetrybased descriptions of crystals. This is enabled by considering a crystal structure representation that encodes all symmetry, and we design a novel neural network architecture which enables using this representation inside a discrete generative model framework. In addition to respecting symmetry by construction, the discrete nature of our model enables fast generation. We additionally present a new metric, Fréchet Wrenformer Distance, which captures the symmetry aspects of the materials generated, and we benchmark WyckoffDiff against recently proposed generative models for crystal generation.
Paperid:2111
Authors:Filippo Bigi · Marcel Langer · Michele Ceriotti
Abstract:
The use of machine learning to estimate the energy of a group of atoms, and the forces that drive them to more stable configurations, have revolutionized the fields of computational chemistry and materials discovery.In this domain, rigorous enforcement of symmetry and conservation laws has traditionally been considered essential. For this reason, interatomic forces are usually computed as the derivatives of the potential energy, ensuring energy conservation. Several recent works have questioned this physically constrained approach, suggesting that directly predicting the forces yields a better tradeoff between accuracy and computational efficiency -- and that energy conservation can be learned during training.This work investigates the applicability of such non-conservative models in microscopic simulations. We identify and demonstrate several fundamental issues, from ill-defined convergence of geometry optimization to instability in various types of molecular dynamics.Contrary to the case of rotational symmetry, energy conservation is hard to learn, monitor, and correct for.The best approach to exploit the acceleration afforded by direct force prediction might be to use it in tandem with a conservative model, reducing -- rather than eliminating -- the additional cost of backpropagation, but avoiding the pathological behavior associated with non-conservative forces.
Paperid:2112
Authors:Zheng Li · Zeyu Liu · Feng Xie · Hao Zhang · Chunchen LIU · zhi geng
Abstract:
We tackle the problem of identifying whether a variable is the cause of a specified target using observational data. Stateof-the-art causal learning algorithms that handle latent variables typically rely on identifying the global causal structure, often represented as a partial ancestral graph (PAG), to infer causal relationships. Although effective, these approaches are often redundant and computationally expensive when the focus is limited to a specific causal relationship. In this work, we introduce novel local characterizations that are necessary and sufficient for various types of causal relationships between two variables, enabling us to bypass the need for global structure learning. Leveraging these local insights, we develop efficient and fully localized algorithms that accurately identify causal relationships from observational data. We theoretically demonstrate the soundness and completeness of our approach. Extensive experiments on benchmark networks and real-world datasets further validate the effectiveness and efficiency of our method.
Paperid:2113
Authors:Karish Grover · Geoff Gordon · Christos Faloutsos
Abstract:
Does the intrinsic curvature of complex networks hold the key to unveiling graph anomalies that conventional approaches overlook? Reconstructionbased graph anomaly detection (GAD) methods overlook such geometric outliers, focusing only on structural and attribute-level anomalies. To this end, we propose CurvGAD - a mixed-curvature graph autoencoder that introduces the notion of curvature-based geometric anomalies. CurvGAD introduces two parallel pipelines for enhanced anomaly interpretability: (1) Curvature-equivariant geometry reconstruction, which focuses exclusively on reconstructing the edge curvatures using a mixed-curvature, Riemannian encoder and Gaussian kernel-based decoder; and (2) Curvature-invariant structure and attribute reconstruction, which decouples structural and attribute anomalies from geometric irregularities by regularizing graph curvature under discrete Ollivier-Ricci flow, thereby isolating the non-geometric anomalies. By leveraging curvature, CurvGAD refines the existing anomaly classifications and identifies new curvature-driven anomalies. Extensive experimentation over 10 real-world datasets (both homophilic and heterophilic) demonstrates an improvement of up to 6.5% over state-of-the-art GAD methods.
Paperid:2114
Authors:Xixi Wu · Yifei Shen · Fangzhou Ge · Caihua Shan · Yizhu Jiao · Xiangguo Sun · Hong Cheng
Abstract:
Node classification is a fundamental task in graph analysis, with broad applications across various fields. Recent breakthroughs in Large Language Models (LLMs) have enabled LLMbased approaches for this task. Although many studies demonstrate the impressive performance of LLM-based methods, the lack of clear design guidelines may hinder their practical application. In this work, we aim to establish such guidelines through a fair and systematic comparison of these algorithms. As a first step, we developed LLMNodeBed, a comprehensive codebase and testbed for node classification using LLMs. It includes ten datasets, eight LLM-based algorithms, and three learning paradigms, and is designed for easy extension with new methods and datasets. Subsequently, we conducted extensive experiments, training and evaluating over 2,200 models, to determine the key settings (e.g., learning paradigms and homophily) and components (e.g., model size) that affect performance. Our findings uncover eight insights, e.g., (1) LLM-based methods can significantly outperform traditional methods in a semi-supervised setting, while the advantage is marginal in a supervised setting; (2) Graph Foundation Models can beat open-source LLMs but still fall short of strong LLMs like GPT-4o in a zero-shot setting. We hope that the release of LLMNodeBed, along with our insights, will facilitate reproducible research and inspire future studies in this field.
Paperid:2115
Authors:Rachel Cummings · Alessandro Epasto · Jieming Mao · Tamalika Mukherjee · Tingting Ou · Peilin Zhong
Abstract:
Abstract:The \emph{turnstile} continual release model of differential privacy captures scenarios where a privacypreserving real-time analysis is sought for a dataset evolving through additions and deletions. In typical applications of real-time data analysis, both the length of the stream $T$ and the size of the universe $|\mathcal{U}|$ from which data come can be extremely large. This motivates the study of private algorithms in the turnstile setting using space sublinear in both $T$ and $|\mathcal{U}|$. In this paper, we give the first sublinear space differentially private algorithms for the fundamental problems of counting distinct elements in the turnstile streaming model. Our algorithm achieves, on arbitrary streams, ~$O_{\eta}(T^{1/3})$ space and additive error, and a $(1+\eta)$-relative approximation for all $\eta \in (0,1)$. Our result significantly improves upon the space requirements of the state-of-the-art algorithms for this problem, which is linear, approaching the known $\Omega(T^{1/4})$ additive error lower bound for arbitrary streams. Moreover, when a bound $W$ on the number of times an item appears in the stream is known, our algorithm provides ~$O_{\eta}(\sqrt{W})$ additive error, using ~$O_{\eta}(\sqrt{W})$ space. This additive error asymptotically matches that of prior work which required instead linear space. Our results address an open question posed by Jain et al. about designing low-memory mechanisms for this problem. We complement this results with a space lower bound for this problem, which shows that any algorithm that uses similar techniques must use space ~$\Omega(T^{1/3})$.
Paperid:2116
Authors:Haojian Huang · Haodong Chen · Shengqiong Wu · Meng Luo · Jinlan Fu · Xinya Du · Hanwang Zhang · Hao Fei
Abstract:
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduceVistaDPO, a novel framework for Video Hierarchical SpatialTemporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i)Instance Level, aligning overall video content with responses; ii)Temporal Level, aligning video temporal semantics with event descriptions; and iii)Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we constructVistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination.
Paperid:2117
Authors:Kaheon Kim · Rentian Yao · Changbo Zhu · Xiaohui Chen
Abstract:
Abstract:The optimal transport barycenter (a.k.a. Wasserstein barycenter) is a fundamental notion of averaging that extends from the Euclidean space to the Wasserstein space of probability distributions. Computation of the *unregularized* barycenter for discretized probability distributions on point clouds is a challenging task when the domain dimension $d > 1$. Most practical algorithms for approximating the barycenter problem are based on entropic regularization. In this paper, we introduce a nearly linear time $O(m \log{m})$ and linear space complexity $O(m)$ primaldual algorithm, the *Wasserstein-Descent $\dot{\mathbb{H}}^1$-Ascent* (WDHA) algorithm, for computing the *exact* barycenter when the input probability density functions are discretized on an $m$-point grid. The key success of the WDHA algorithm hinges on alternating between two different yet closely related Wasserstein and Sobolev optimization geometries for the primal barycenter and dual Kantorovich potential subproblems. Under reasonable assumptions, we establish the convergence rate and iteration complexity of WDHA to its stationary point when the step size is appropriately chosen. Superior computational efficacy, scalability, and accuracy over the existing Sinkhorn-type algorithms are demonstrated on high-resolution (e.g., $1024 \times 1024$ images) 2D synthetic and real data.
Paperid:2118
Authors:Xiaoshuai Hao · Lingyu Liu · Yunfeng Diao · Rong Yin · Pengwei Wang · Jing Zhang · Lingdong Kong · Shu Zhao
Abstract:
Robust highdefinition (HD) map construction is vital for autonomous driving, yet existing methods often struggle with incomplete multi-view camera data. This paper presents SafeMap, a novel framework specifically designed to ensure accuracy even when certain camera views are missing. SafeMap integrates two key components: the Gaussian-based Perspective View Reconstruction (G-PVR) module and the Distillation-based Bird's-Eye-View (BEV) Correction (D-BEVC) module. G-PVR leverages prior knowledge of view importance to dynamically prioritize the most informative regions based on the relationships among available camera views. Furthermore, D-BEVC utilizes panoramic BEV features to correct the BEV representations derived from incomplete observations. Together, these components facilitate comprehensive data reconstruction and robust HD map generation. SafeMap is easy to implement and integrates seamlessly into existing systems, offering a plug-and-play solution for enhanced robustness. Experimental results demonstrate that SafeMap significantly outperforms previous methods in both complete and incomplete scenarios, highlighting its superior performance and resilience. The source code will be released.
Paperid:2119
Authors:Belle Liu · Jacob I Sacks · Matthew Golub
Abstract:
Advances in neural recording technology now enable simultaneous recording of neural population activity across multiple brain regions, promising new insights into brainwide information processing. These experiments have prompted the development of data-driven "communication models'' that infer the signals communicated between recorded brain regions. Accurately identifying such communication is difficult because these signals are not directly observed in multi-region recordings, and typically, many different models can explain the observed data through differing accounts of communication. This challenge is amplified in expressive models, which are often needed to capture the nonlinear and non-stationary dynamics of neural population activity. To address these issues, we propose MR-LFADS, an expressive communication model that improves system identification by inferring region-specific inputs and cross-region communication under appropriate constraints. We show that MR-LFADS outperforms existing approaches at identifying communication signals in dozens of simulations of multi-region networks trained to perform a diverse set of cognitive neuroscience tasks. By advancing the analysis of multi-region neural recordings, MR-LFADS could provide a powerful tool for uncovering the principles of brain-wide information processing.
Paperid:2120
Authors:Cristiana Diaconu · Matthew Ashman · Eric Langezaal · Adrian Weller · Richard E Turner
Abstract:
Effective modelling of largescale spatio-temporal datasets is essential for many domains, yet existing approaches often impose rigid constraints on the input data, such as requiring them to lie on fixed-resolution grids. With the rise of foundation models, the ability to process diverse, heterogeneous data structures is becoming increasingly important. Neural processes (NPs), particularly transformer neural processes (TNPs), offer a promising framework for such tasks, but struggle to scale to large spatio-temporal datasets due to the lack of an efficient attention mechanism. To address this, we introduce gridded pseudo-token TNPs which employ specialised encoders and decoders to handle unstructured data and utilise a processor comprising gridded pseudo-tokens with efficient attention mechanisms. Furthermore, we develop equivariant gridded TNPs for applications where exact or approximate translation equivariance is a useful inductive bias, improving accuracy and training efficiency. Our method consistently outperforms a range of strong baselines in various synthetic and real-world regression tasks involving large-scale data, while maintaining competitive computational efficiency. Experiments with weather data highlight the potential of gridded TNPs and serve as just one example of a domain where they can have a significant impact.
Paperid:2121
Authors:Xinyi Wan · Penghui Qi · Guangxing Huang · Jialin Li · Min Lin
Abstract:
Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of inflight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption.
Paperid:2122
Authors:Samidha Verma · Arushi Goyal · Ananya Mathur · Ankit Anand · Sayan Ranu
Abstract:
Graph Edit Distance (GED) is a widely used metric for measuring similarity between two graphs. Computing the optimal GED is NPhard, leading to the development of various neural and non-neural heuristics. While neural methods have achieved improved approximation quality compared to non-neural approaches, they face significant challenges: (1) They require large amounts of ground truth data, which is itself NP-hard to compute. (2) They operate as black boxes, offering limited interpretability. (3) They lack cross-domain generalization, necessitating expensive retraining for each new dataset. We address these limitations with GRAIL, introducing a paradigm shift in this domain. Instead of training a neural model to predict GED, GRAIL employs a novel combination of large language models (LLMs) and automated prompt tuning to generate aprogramthat is used to compute GED. This shift from predicting GED to generating programs imparts various advantages, including end-to-end interpretability and an autonomous self-evolutionary learning mechanism without ground-truth supervision. Extensive experiments on seven datasets confirm that GRAIL not only surpasses state-of-the-art GED approximation methods in prediction quality but also achieves robust cross-domain generalization across diverse graph distributions.
Paperid:2123
Authors:Yujin Kim · Hyunsoo Kim · Hyunwoo Kim · Suhyun Kim
Abstract:
Opensource pre-trained models hold great potential for diverse applications, but their utility declines without access to the training data. Data-Free Image Synthesis (DDIS) aims to generate images that approximate the learned data distribution of a pre-trained model in a data-free manner. However, existing methods usually produce samples that deviate significantly from the training data distribution due to the lack of prior knowledge about natural images. To overcome this limitation, we propose DDIS, the first Diffusion-assisted Data-free \Image Synthesis method that leverage a text-to-image diffusion model as a powerful image prior, improving synthetic image quality. DDIS extracts knowledge about the learned distribution from the given model and uses it to guide the diffusion model, enabling the generation of images that accurately align with the training data distribution. To achieve this, we introduce Domain Alignment Guidance (DAG) that aligns the synthetic data domain with the training data distribution during the diffusion sampling process. Additionally, we optimize a single Class Alignment Token (CAT) embedding to effectively capture class-specific attributes in the training dataset. Experiments on PACS and ImageNet demonstrate that DDIS outperforms existing DFIS methods by generating samples that better reflect the training data distribution, achieving state-of-the-art performance in data-free applications.
Paperid:2124
Authors:Chenyin Gao · Peter Gilbert · Larry Han
Abstract:
Standard conformal prediction ensures marginal coverage but consistently undercovers underrepresented groups, limiting its reliability for fair uncertainty quantification. Group fairness requires prediction sets to achieve a userspecified coverage level within each protected group. While group-wise conformal inference meets this requirement, it often produces excessively wide prediction sets due to limited sample sizes in underrepresented groups, highlighting a fundamental tradeoff between fairness and efficiency. To bridge this gap, we introduce Surrogate-Assisted Group-Clustered Conformal Inference (SAGCCI), a framework that improves efficiency through two key innovations: (1) clustering protected groups with similar conformal score distributions to enhance precision while maintaining fairness, and (2) deriving an efficient influence function that optimally integrates surrogate outcomes to construct tighter prediction sets. Theoretically, SAGCCI guarantees approximate group-conditional coverage in a doubly robust manner under mild convergence conditions, enabling flexible nuisance model estimation. Empirically, through simulations and an analysis of the phase 3 Moderna COVE COVID-19 vaccine trial, we demonstrate that SAGCCI outperforms existing methods, producing narrower prediction sets while maintaining valid group-conditional coverage, effectively balancing fairness and efficiency in uncertainty quantification.
Paperid:2125
Authors:Chuang Liu · Hongyan Xu · Yichao Cao · Xiu Su · Zhe Qu · Tianfa Li · Shan An · Haogang Zhu
Abstract:
Medical imaging faces significant challenges in singledomain generalization (SDG) due to the diversity of imaging devices and the variability among data collection centers. To address these challenges, we propose \textbf{TinyMIG}, a framework designed to transfer generalization capabilities from vision foundation models to medical imaging SDG. TinyMIG aims to enable lightweight specialized models to mimic the strong generalization capabilities of foundation models in terms of both global feature distribution and local fine-grained details during training. Specifically, for global feature distribution, we propose a Global Distribution Consistency Learning strategy that mimics the prior distributions of the foundation model layer by layer. For local fine-grained details, we further design a Localized Representation Alignment method, which promotes semantic alignment and generalization distillation between the specialized model and the foundation model. These mechanisms collectively enable the specialized model to achieve robust performance in diverse medical imaging scenarios. Extensive experiments on large-scale benchmarks demonstrate that TinyMIG, with extremely low computational cost, significantly outperforms state-of-the-art models, showcasing its superior SDG capabilities. All the code and model weights will be publicly available.
Paperid:2126
Authors:Mohsen Dehghankar · Mahdi Erfanian · Abolfazl Asudeh
Abstract:
Abstract:Despite their tremendous success and versatility, Deep Neural Networks (DNNs) such as Large Language Models (LLMs) suffer from inference inefficiency and rely on advanced computational infrastructure.To address these challenges and make these models more accessible and costeffective, in this paper, we propose algorithms to improve the inference time and memory efficiency of DNNs with binary and ternary weight matrices.Particularly focusing on matrix multiplication as the bottleneck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms.Specifically, for a $n\times n$ weight matrix, our efficient algorithm guarantees a time complexity of $O(\frac{n^2}{\log n})$, a logarithmic factor improvement over the standard vector-matrix multiplication.Besides theoretical analysis, we conduct extensive experiments to evaluate the practical efficiency of our algorithms. Our results confirm the superiority of our approach both with respect to time and memory, as we observed a reduction in the multiplication time up to 29x and memory usage up to 6x. When applied to LLMs, our experiments show up to a 5.24x speedup in the inference time.
Paperid:2127
Authors:Prashanth Vijayaraghavan · Luyao Shi · Ehsan Degan · Vandana Mukherjee · Xin Zhang
Abstract:
Analog circuit topology synthesis is integral to Electronic Design Automation (EDA), enabling the automated creation of circuit structures tailored to specific design requirements. However, the vast design search space and strict constraint adherence make efficient synthesis challenging. Leveraging the versatility of Large Language Models (LLMs), we propose AUTOCIRCUITRL, a novel reinforcement learning (RL)-based framework for automated analog circuit synthesis. The framework operates in two phases: instruction tuning, where an LLM learns to generate circuit topologies from structured prompts encoding design constraints, and RL refinement, where reward models iteratively optimize circuit topologies by evaluating and adjusting designs to ensure adherence to constraints like validity, efficiency, and expected output voltage. Empirical results show that AUTOCIRCUIT-RL generates ~12% more valid circuits and improves efficiency by ~14% compared to the best baselines, while reducing duplicate generation rates by ~38%. It achieves over 60% success in synthesizing valid circuits with limited training data, demonstrating strong generalization. These findings highlight the framework's effectiveness in scaling to complex circuits while maintaining efficiency and constraint adherence, marking a significant advancement in AI-driven circuit design.
Paperid:2128
Authors:Jihao Andreas Lin · Sebastian Ament · Maximilian Balandat · David Eriksson · Jose Miguel Hernandez-Lobato · Eytan Bakshy
Abstract:
Applying Gaussian processes (GPs) to very large datasets remains a challenge due to limited computational scalability. Matrix structures such as the Kronecker product can accelerate operations significantly, but their application commonly entails approximations or unrealistic assumptions. For example, a kernel matrix associated with a product kernel can be expressed as an exact Kronecker product if the data is gridded. However, this exact structure gets destroyed when observations are missing from the grid, a frequent occurrence in realworld applications such as time series. To lift this limitation, we propose an exactlatent Kronecker structure, which expresses the covariance matrix of observed values as the projection of a latent Kronecker product. In combination with iterative linear system solvers, our method permits inference of the exact GP model while requiring substantially fewer computational resources than standard iterative methods. We demonstrate that our method outperforms state-of-the-art sparse and variational GPs on real-world datasets with up to five million examples, including robotics, AutoML, and climate applications.
Paperid:2129
Authors:Alexander Bienstock · Ujjwal Kumar · Antigoni Polychroniadou
Abstract:
Federated Learning (FL) solutions with central Differential Privacy (DP) have seen large improvements in their utility in recent years arising from thematrix mechanism, while FL solutions with local (more private) DP have lagged behind. In this work, we introduce thedistributedmatrix mechanism to achieve the bestof-both-worlds; better privacy of local DP and better utility from the matrix mechanism. We accomplish this using a novel cryptographic protocol that securely transfers sensitive values across rounds with constant communication overhead. This protocol accommodates the dynamic participation of users required by FL, including those that may drop out from the computation. We provide experiments which show that our mechanism indeed significantly improves the utility of FL models compared to previous local DP mechanisms, with little added overhead.
Paperid:2130
Authors:Rongzhe Wei · Mufei Li · Mohsen Ghassemi · Eleonora Kreacic · Yifan Li · Xiang Yue · Bo Li · Vamsi Potluru · Pan Li · Eli Chien
Abstract:
Large Language Models (LLMs) embed sensitive, humangenerated data, prompting the need for unlearning methods. Although certified unlearning offers strong privacy guarantees, its restrictive assumptions make it unsuitable for LLMs, giving rise to various heuristic approaches typically assessed through empirical evaluations. These standard evaluations randomly select data for removal, apply unlearning techniques, and use membership inference attacks (MIAs) to compare unlearned models against models retrained without the removed data. However, to ensure robust privacy protections for every data point, it is essential to account for scenarios in which certain data subsets face elevated risks. Prior research suggests that outliers, particularly including data tied to minority groups, often exhibit higher memorization propensity which indicates they may be more difficult to unlearn. Building on these insights, we introduce a complementary, minority-aware evaluation framework to highlight blind spots in existing frameworks. We substantiate our findings with carefully designed experiments, using canaries with personally identifiable information (PII) to represent these minority subsets and demonstrate that they suffer at least 20\% higher privacy leakage across various unlearning methods, MIAs, datasets, and LLM scales. Our proposed minority-aware evaluation framework marks an essential step toward more equitable and comprehensive assessments of LLM unlearning efficacy.
Paperid:2131
Authors:Zhuoying Li · Zhu Xu · Yuxin Peng · Yang Liu
Abstract:
Instructionbased image editing, which aims to modify the image faithfully towards instruction while preserving irrelevant content unchanged, has made advanced progresses. However, there still lacks a comprehensive metric for assessing the editing quality. Existing metrics either require high costs concerning human evaluation, which hinders large-scale evaluation, or adapt from other tasks and lose specified concerns, failing to comprehensively evaluate the modification of instruction and the preservation of irrelevant regions, resulting in biased evaluation. To tackle it, we introduce a novel metric Balancing Preservation Modification (BPM), that tailored for instruction-based image editing by explicitly disentangling the image into editing-relevant and irrelevant regions for specific consideration. We first identify and locate editing-relevant regions, followed by a two-tier process to assess editing quality: Region-Aware Judge evaluates whether the position and size of the edited region align with instruction, and Semantic-Aware Judge further assesses the instruction compliance within editing-relevant regions as well as content preservation within irrelevant regions, yielding comprehensive and interpretable quality assessment. Moreover, the editing-relevant region localization in BPM can be integrated into image editing approaches to improve the editing quality, manifesting its wild application. We verify the effectiveness of BPM on comprehensive instruction-editing data, and the results show that we yield the highest alignment with human evaluation compared to existing metrics, indicating efficacy.
Paperid:2132
Authors:Huizhuo Yuan · Yifeng Liu · Shuang Wu · zhou Xun · Quanquan Gu
Abstract:
Training deep neural networksand more recently, large models--demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vArianceReductionShine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.
Paperid:2133
Authors:Tao Wang · Ruipeng Zhang · Sicun Gao
Abstract:
Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.
Paperid:2134
Authors:Yao Shu · Qixin Zhang · Kun He · Zhongxiang Dai
Abstract:
Recently, zerothorder (ZO) optimization plays an essential role in scenarios where gradient information is inaccessible or unaffordable, such as black-box systems and resource-constrained environments. While existing adaptive methods such as ZO-AdaMM have shown promise, they are fundamentally limited by their underutilization of moment information during optimization, usually resulting in underperforming convergence. To overcome these limitations, this paper introducesRefined Adaptive Zeroth-Order Optimization(R-AdaZO). Specifically, we first show the untapped variance reduction effect of first moment estimate on ZO gradient estimation, which improves the accuracy and stability of ZO updates. We then refine the second moment estimate based on these variance-reduced gradient estimates to better capture the geometry of the optimization landscape, enabling a more effective scaling of ZO updates. We present rigorous theoretical analysis to show(a)the first analysisto the variance reduction of first moment estimate in ZO optimization,(b)the improved second moment estimateswith a more accurate approximation of its variance-free ideal,(c)the first variance-aware convergence frameworkfor adaptive ZO methods, which may be of independent interest, and(d)the faster convergenceof R-AdaZO than existing baselines like ZO-AdaMM. Our extensive experiments, including synthetic problems, black-box adversarial attack, and memory-efficient fine-tuning of large language models (LLMs), further verify the superior convergence of R-AdaZO, indicating that R-AdaZO offers an improved solution for real-world ZO optimization challenges.
Paperid:2135
Authors:Jialin Zhao · Yingtao Zhang · Xinghang Li · Huaping Liu · Carlo Cannistraci
Abstract:
The growing demands on GPU memory posed by the increasing number of neural network parameters call for training approaches that are more memoryefficient. Previous memory reduction training techniques, such as Low-Rank Adaptation (LoRA) and ReLoRA, face challenges, with LoRA being constrained by its low-rank structure, particularly during intensive tasks like pre-training, and ReLoRA suffering from saddle point issues. In this paper, we proposeSparse Spectral Training (SST)to optimize memory usage forpre-training. SSTupdates all singular valuesandselectively updates singular vectorsthrough a multinomial sampling method weighted by the magnitude of the singular values. Furthermore, SST employssingular value decomposition to initialize and periodically reinitializelow-rank parameters, reducing distortion relative to full-rank training compared to other low-rank methods. Through comprehensive testing on both Euclidean and hyperbolic neural networks across various tasks, including natural language generation, machine translation, node classification, link prediction, and image classification, SST demonstrates its ability to outperform existing memory reduction training methods and is comparable to full-rank training in various cases. On LLaMA-1.3B, with only 18.7\% of the parameters trainable compared to full-rank training (using a rank equivalent to 6\% of the embedding dimension), SST reduces the perplexity gap between other low-rank methods and full-rank training by97.4\%. This result highlights SST as an effective parameter-efficient technique for model pre-training, offering a promising new paradigm for achieving scalable and memory-efficient neural network training.
Paperid:2136
Authors:Zihao Xu · Dawei xu · Zihan Li · Chuan Zhang
Abstract:
Generative image steganography (GIS) is an emerging technique that conceals secret messages when generating images. Compared with GAN or Flow based GIS schemes, diffusion model based solutions can provide highquality and more diverse generation, thus receiving considerable attention recently. However, the state-of-the-art GIS schemes still have issues with extraction accuracy, robustness, efficiency, and practicality. To address the above issues, this paper proposes a practical and robust message-driven GIS framework, called MDDM. MDDM uses a carefully designed encoding strategy to encode the binary message and uses it as the starting point for diffusion, allowing users to generate diverse artistic images through controllable conditional text input without requiring additional training. During the information extraction process, the receiver only needs to use the secret shared Cardan Grille to accurately perform diffusion inversion and recover the binary data, without knowing the seed or prompt. Experimental results show the advantages of MDDM in accuracy, controllability, efficiency, practicality, and security. Through flexible steganography strategies, MDDM can always achieve almost 100\% accuracy. In addition, MDDM applies to diffusion in both conditional and unconditional.
Paperid:2137
Authors:Alaa Anani · Tobias Lorenz · Mario Fritz · Bernt Schiele
Abstract:
Abstract:Deep learning explainability is essential to understand model predictions. Post hoc attribution methods determine which image pixels most influence a classifier prediction, but are nonrobust against imperceptible noise, undermining their trustworthiness. Certified attribution methods should prove that pixel-level importance values are robust; however, prior work only provides image-level bounds, which are too coarse. We introduce a certification approach that guarantees the pixel-level robustness of any black-box attribution method via Randomized Smoothing against $\ell_2$-bounded input noise. By sparsifying then smoothing attributions, we reformulate the setup into a segmentation certification problem. We propose novel qualitative and quantitative metrics to assess the certified robustness of three families of attribution methods. Visualizing pixel-level certificates nicely complements the visual nature of attribution methods by providing a reliable certified output for downstream tasks. For quantitative evaluation, we introduce two metrics: (i) the percentage of certified pixels, measuring robustness, and (ii) certified localization. Our extensive experiments on ImageNet show high-quality certified outputs across attribution methods, comparing their certified visuals, robustness and localization performance.
Paperid:2138
Authors:Mingchen Zhuge · Changsheng Zhao · Dylan Ashley · Wenyi Wang · Dmitrii Khizbullin · Yunyang Xiong · Zechun Liu · Ernie Chang · Raghuraman Krishnamoorthi · Yuandong Tian · Yangyang Shi · Vikas Chandra · Jürgen Schmidhuber
Abstract:
Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomesignoring the step-by-step nature of the thinking done by agentic systems---or require excessive manual labour. To address this, we introduce theAgent-as-a-Judgeframework, wherein agentic systems are used to evaluate agentic systems. This is a natural extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving processes for more precise evaluations. We apply the Agent-as-a-Judge framework to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we presentDevAI, a new benchmark of 55 realistic AI code generation tasks. DevAI includes rich manual annotations, like a total of 365 hierarchical solution requirements, which make it particularly suitable for an agentic evaluator. We benchmark three of the top code-generating agentic systems using Agent-as-a-Judge and find that our framework dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that this work represents a concrete step towards enabling vastly more sophisticated agentic systems.
Paperid:2139
Authors:Ya-Chi Chu · Wenzhi Gao · Yinyu Ye · Madeleine Udell
Abstract:
Abstract:This paper investigates the convergence properties of the hypergradient descent method ($\texttt{HDM}$), a 25year-old heuristic originally proposed to select stepsizes for stochastic first-order methods adaptively. We provide the first rigorous convergence analysis of $\texttt{HDM}$ with a recently developed online learning framework. Notably, $\texttt{HDM}$ automatically identifies the optimal stepsize for the local optimization landscape and achieves local superlinear convergence. Our analysis explains the instabilities of $\texttt{HDM}$ reported in the literature and proposes efficient strategies to address them. We also develop two practical $\texttt{HDM}$ variants incorporating heavy-ball and Nesterov momentum. Experiments on deterministic convex problems show that $\texttt{HDM}$ with heavy-ball momentum ($\texttt{HDM-HB}$) exhibits robust performance that significantly outperforms other adaptive first-order methods. In particular, $\texttt{HDM-HB}$ often matches the performance of $\texttt{L-BFGS}$, an efficient and practical quasi-Newton method, using less memory and cheaper iterations.
Paperid:2140
Authors:Langzhang Liang · Fanchen Bu · Zixing Song · Zenglin Xu · Shirui Pan · Kijung Shin
Abstract:
The messagepassing paradigm of Graph Neural Networks often struggles with exchanging information across distant nodes typically due to structural bottlenecks in certain graph regions, a limitation known as over-squashing. To reduce such bottlenecks, graph rewiring, which modifies graph topology, has been widely used. However, existing graph rewiring techniques often overlook the need to preserve critical properties of the original graph, e.g., spectral properties. Moreover, many approaches rely on increasing edge count to improve connectivity, which introduces significant computational overhead and exacerbates the risk of over-smoothing. In this paper, we propose a novel graph-rewiring method that leverages spectral graph sparsification for mitigating over-squashing. Specifically, our method generates graphs with enhanced connectivity while maintaining sparsity and largely preserving the original graph spectrum, effectively balancing structural bottleneck reduction and graph property preservation. Experimental results validate the effectiveness of our approach, demonstrating its superiority over strong baseline methods in classification accuracy and retention of the Laplacian spectrum.
Paperid:2141
Authors:Haorui Wang · Jeff Guo · Lingkai Kong · Rampi Ramprasad · Philippe Schwaller · Yuanqi Du · Chao Zhang
Abstract:
Retrosynthesis, the process of breaking down a target molecule into simpler precursors through a series of valid reactions, stands at the core of organic chemistry and drug development. Although recent machine learning (ML) research has advanced singlestep retrosynthetic modeling and subsequent route searches, these solutions remain restricted by the extensive combinatorial space of possible pathways. Concurrently, large language models (LLMs) have exhibited remarkable chemical knowledge, hinting at their potential to tackle complex decision-making tasks in chemistry. In this work, we explore whether LLMs can successfully navigate the highly constrained, multi-step retrosynthesis planning problem. We introduce an efficient scheme for encoding reaction pathways and present a new route-level search strategy, moving beyond the conventional step-by-step reactant prediction. Through comprehensive evaluations, we show that our LLM-augmented approach excels at retrosynthesis planning and extends naturally to the broader challenge of synthesizable molecular design.
Paperid:2142
Authors:Chengyi Cai · Zesheng Ye · Lei Feng · Jianzhong Qi · Feng Liu
Abstract:
Model reprogramming adapts pretrained models to downstream tasks by modifying only the input and output spaces.Visual reprogramming(VR) is one instance for vision tasks that adds a trainable noise pattern (i.e., a visual prompt) to input images to facilitate downstream classification. The existing VR approaches for CLIP train a single visual prompt using all descriptions of different downstream classes. However, the limited learning capacity may result in (1) a failure to capture diverse aspects of the descriptions (e.g., shape, color, and texture), and (2) a possible bias toward less informative attributes that do not help distinguish between classes. In this paper, we introduce a decouplingand-reweighting framework. Ourdecoupled visual prompts(DVP) are optimized using descriptions grouped by explicitcauses (DVP-cse) or unsupervisedclusters (DVP-cls). Then, we integrate the outputs of these visual prompts with aprobabilistic reweighting matrix(PRM) that measures their contributions to each downstream class. Theoretically, DVP lowers the empirical risk bound. Experimentally, DVP outperforms baselines on average across 11 downstream datasets. Notably, the DVP-PRM integration enables insights into how individual visual prompts influence classification decisions, providing a probabilistic framework for understanding reprogramming.
Paperid:2143
Authors:Ching-Hua Lee · Chouchang Yang · Jaejin Cho · Yashas Malur Saidutta · Rakshith Sharma Srinivasa · Yilin Shen · Hongxia Jin
Abstract:
Denoising diffusion probabilistic models (DDPMs) can be utilized for recovering a clean signal from its degraded observation(s) by conditioning the model on the degraded signal. The degraded signals are themselves contaminated versions of the clean signals; due to this correlation, they may encompass certain useful information about the target clean data distribution. However, existing adoption of the standard Gaussian as the prior distribution in turn discards such information, resulting in suboptimal performance. In this paper, we propose to improve conditional DDPMs for signal restoration by leveraging a more informative prior that is jointly learned with the diffusion model. The proposed framework, called RestoreGrad, seamlessly integrates DDPMs into the variational autoencoder framework and exploits the correlation between the degraded and clean signals to encode a better diffusion prior. On speech and image restoration tasks, we show that RestoreGrad demonstrates faster convergence (5-10 times fewer training steps) to achieve better quality of restored signals over existing DDPM baselines, and improved robustness to using fewer sampling steps in inference time (2-2.5 times fewer), advocating the advantages of leveraging jointly learned prior for efficiency improvements in the diffusion process.
Paperid:2144
Authors:Chia-Han Yeh · Tse-Sheng Nan · Risto Vuorio · Wei Hung · Hung Wu · Shao-Hua Sun · Ping-Chun Hsieh
Abstract:
Policy learning under action constraints plays a central role in ensuring safe behaviors in various robot control and resource allocation applications.In this paper, we study a new problem setting termed ActionConstrained Imitation Learning (ACIL), where an action-constrained imitator aims to learn from a demonstrative expert with larger action space.The fundamental challenge of ACIL lies in the unavoidable mismatch of occupancy measure between the expert and the imitator caused by the action constraints. We tackle this mismatch through trajectory alignment and propose DTWIL, which replaces the original expert demonstrations with a surrogate dataset that follows similar state trajectories while adhering to the action constraints. Specifically, we recast trajectory alignment as a planning problem and solve it via Model Predictive Control, which aligns the surrogate trajectories with the expert trajectories based on the Dynamic Time Warping (DTW) distance. Through extensive experiments, we demonstrate that learning from the dataset generated by DTWIL significantly enhances performance across multiple robot control tasks and outperforms various benchmark imitation learning algorithms in terms of sample efficiency.
Paperid:2145
Authors:Maheed Ahmed · Jayanth Bhargav · Mahsa Ghasemi
Abstract:
Representation learning plays a crucial role in reinforcement learning, especially in complex environments with highdimensional and unstructured states. Effective representations can enhance the efficiency of learning algorithms by improving sample efficiency and generalization across tasks. This paper considers the Laplacian-based framework for representation learning, where the eigenvectors of the Laplacian matrix of the underlying transition graph are leveraged to encode meaningful features from raw sensory observations of the states. Despite the promising algorithmic advances in this framework, it remains an open question whether the Laplacian-based representations can be learned online and with theoretical guarantees along with policy learning. To answer this question, we study online Laplacian-based representation learning, where the graph-based representation is updated simultaneously while the policy is updated by the reinforcement learning algorithm. We design an online optimization formulation by introducing the Asymmetric Graph Drawing Objective (AGDO) and provide a theoretical analysis of the convergence of running online projected gradient descent on AGDO under mild assumptions. Specifically, we show that if the policy learning algorithm induces a bounded drift on the policy, running online projected gradient descent on AGDO exhibits ergodic convergence. Our extensive simulation studies empirically validate the guarantees of convergence to the true Laplacian representation. Furthermore, we provide insights into the compatibility of different reinforcement learning algorithms with online representation learning.
Paperid:2146
Authors:Songlin Xu · Xinyu Zhang
Abstract:
Using deep neural networks as computational models to simulate cognitive process can provide key insights into human behavioral dynamics. Challenges arise when environments are highly dynamic, obscuring stimulusbehavior relationships. However, the majority of current research focuses on simulating human cognitive behaviors under ideal conditions, neglecting the influence of environmental disturbances. We propose CogReact, integrating drift-diffusion with deep reinforcement learning to simulate granular effects of dynamic environmental stimuli on human cognitive process. Quantitatively, it improves cognition modelling by considering temporal effect of environmental stimuli on cognitive process and captures both subject-specific and stimuli-specific behavioural differences. Qualitatively, it captures general trends in human cognitive process under stimuli, better than baselines. Our approach is examined in diverse environmental influences on various cognitive tasks. Overall, it demonstrates a powerful, data-driven methodology to simulate, align with, and understand the vagaries of human cognitive response in dynamic contexts.
Paperid:2147
Authors:Iurii Kemaev · Dan Andrei Calian · Luisa Zintgraf · Gregory Farquhar · Hado van Hasselt
Abstract:
Gradientbased bilevel optimisation is a powerful technique with applications in hyperparameter optimisation, task adaptation, algorithm discovery, meta-learning more broadly, and beyond. It often requires differentiating through the gradient-based optimisation process itself, leading to "gradient-of-a-gradient" calculations with computationally expensive second-order and mixed derivatives. While modern automatic differentiation libraries provide a convenient way to write programs for calculating these derivatives, they oftentimes cannot fully exploit the specific structure of these problems out-of-the-box, leading to suboptimal performance. In this paper, we analyse such cases and propose Mixed-Flow Meta-Gradients, or MixFlow-MG -- a practical algorithm that uses mixed-mode differentiation to construct more efficient and scalable computational graphs yielding over 10x memory and up to 25\% wall-clock time improvements over standard implementations in modern meta-learning setups.
Paperid:2148
Authors:Zhiruo Wang · Jiayuan Mao · Daniel Fried · Graham Neubig
Abstract:
Despite the potential of language modelbased agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks — Mind2Web and WebArena — that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.
Paperid:2149
Authors:Yang Luo · Zangwei Zheng · Ziheng Qin · Zirui Zhu · Yong Liu · Yang You
Abstract:
Abstract:Largebatch training has become a cornerstone in accelerating the training of deep neural networks, yet it poses challenges in optimization and generalization. Existing optimizers like AdamW present performance degradation during language models' large-batch training, due to the information bottleneck of attention layers caused by the sharp increase of max attention logit. While the LAMB optimizer partially addresses this issue, some attention layers still face this issue. The reason is that $l_2$-norm-based trust ratios in LAMB are less effective in directly influencing the max value of query/key weights. Furthermore, the weight-wise trust ratio in LAMB is error-prone due to overlooking relationships of weight values within rows or columns. Building on these observations, we propose a novel optimizer, MERIT, which leverages the max-norm to calculate the trust ratio to constrain the max attention logit more effectively. Moreover, we further construct element-wise trust ratios to provide more robust update scaling by focusing on local weight structures. Extensive experiments of large-batch training across various sizes of GPT-2 models demonstrate the superior performance of MERIT. Notably, during the training of GPT-2 Medium, MERIT enables a 6k batch size without any performance degradation compared to the standard batch size (480) with 48B training tokens.This work highlights the importance of considering the max attention logit and finer granularity trust ratio calculation in large-batch training. It successfully improves the training stability and paves the way for larger batch usage, enabling faster development and iteration on large language models.
Paperid:2150
Authors:Yingzhen Yang
Abstract:
We introduce a new tool, Transductive Local Complexity (TLC), to analyze the generalization performance of transductive learning methods and motivate new transductive learning algorithms. Our work extends the idea of the popular Local Rademacher Complexity (LRC) \citep{bartlett2005} to the transductive setting with considerable and novel changes compared to the analysis of typical LRC methods in the inductive setting. While LRC has been widely used as a powerful tool in the analysis of inductive models with sharp generalization bounds for classification and minimax rates for nonparametric regression, it remains an open problem whether a localized version of Rademacher complexity based tool can be designed and applied to transductive learning and gain sharp bound for transductive learning which is consistent with the inductive excess risk bound by (LRC) \citep{bartlett2005}. We give a confirmative answer to this open problem by TLC. Similar to the development of LRC \citep{Bartlett2003}, we build TLC by first establishing a novel and sharp concentration inequality for supremum of empirical processes for the gap between test and training loss in the setting of sampling uniformly without replacement. Then a peeling strategy and a new surrogate variance operator are use to derive the following excess risk bound in the transductive setting, which is consistent with that of the classical LRC based excess risk bound in the inductive setting. As an application of TLC, we use the new TLC tool to analyze the Transductive Kernel Learning (TKL) model, and derive sharper excess risk bound than that by the current stateof-the-art \citep{TolstikhinBK14-local-complexity-TRC} under the same assumptions. As an result of independent interest, the concentration inequality for the test-train process is used to derive a sharp concentration inequality for the general supremum of empirical process involving random variables in the setting of sampling uniformly without replacement, with comparison to current concentration inequalities under the same setting.
Paperid:2151
Authors:Yushi Huang · Zining Wang · Ruihao Gong · Jing Liu · Xinjie Zhang · Jinyang Guo · Xianglong Liu · Jun Zhang
Abstract:
Abstract:Diffusion Transformers (DiTs) excel in generative tasks but face practical deployment challenges due to high inference costs. Feature caching, which stores and retrieves redundant computations, offers the potential for acceleration. Existing learningbased caching, though adaptive, overlooks the impact of the prior timestep. It also suffers from misaligned objectives-*aligned predicted noise vs. high-quality images*-between training and inference. These two discrepancies compromise both performance and efficiency.To this end, we *harmonize* training and inference with a novel learning-based *caching* framework dubbed **HarmoniCa**. It first incorporates *Step-Wise Denoising Training* (SDT) to ensure the continuity of the denoising process, where prior steps can be leveraged. In addition, an *Image Error Proxy-Guided Objective* (IEPO) is applied to balance image quality against cache utilization through an efficient proxy to approximate the image error. Extensive experiments across $8$ models, $4$ samplers, and resolutions from $256\times256$ to $2K$ demonstrate superior performance and speedup of our framework. For instance, it achieves over $40\\%$ latency reduction (*i.e.*, $2.07\times$ theoretical speedup) and improved performance on PixArt-$\alpha$. Remarkably, our *image-free* approach reduces training time by $25\\%$ compared with the previous method. *Our code will be released upon acceptance.*
Paperid:2152
Authors:Rong-Xi Tan · Ming Chen · Ke Xue · Yao Wang · Yaoyuan Wang · Fu Sheng · Chao Qian
Abstract:
The pursuit of universal blackbox optimization (BBO) algorithms is a longstanding goal. However, unlike domains such as language or vision, where scaling structured data has driven generalization, progress in offline BBO remains hindered by the lack of unified representations for heterogeneous numerical spaces. Thus, existing offline BBO approaches are constrained to single-task and fixed-dimensional settings, failing to achieve cross-domain universal optimization. Recent advances in large language models (LLMs) offer a promising path forward: their embeddings capture latent relationships in a unifying way, enabling universal optimization across different data types possible. In this paper, we discuss multiple potential approaches, including an end-to-end learning framework in the form of next-token prediction, as well as prioritizing the learning of latent spaces with strong representational capabilities. To validate the effectiveness of these methods, we collect as many offline BBO tasks and data as possible from open-source academic works for training. Experiments demonstrate the universality and effectiveness of our proposed methods. Our findings suggest that unifying language model priors and learning string embedding space can overcome traditional barriers in universal BBO, paving the way for general-purpose BBO algorithms. The code is provided in the supplemental materials.
Paperid:2153
Authors:Spandan Pyakurel · Qi Yu
Abstract:
FineGrained Openset Detection (FGOD) poses a fundamental challenge due to the similarity between the openset classes and those closed-set ones. Since real-world objects/entities tend to form a hierarchical structure, the fine-grained relationship among the closed-set classes as captured by the hierarchy could potentially improve the FGOD performance. Intuitively, the hierarchical dependency among different classes allows the model to recognize their subtle differences, which in turn makes it better at differentiating similar open-set classes even they may share the same parent. However, simply performing openset detection in a top-down fashion by building a local detector for each node may result in a poor detection performance. Our theoretical analysis also reveals that maximizing the probability of the path leading to the ground-truth leaf node also results in a sub-optimal training process. To systematically address this issue, we propose to formulate a novel state-based node representation, which constructs a state space based upon the entire hierarchical structure. We prove that the state-based representation guarantees to maximize the probability on the path leading to the ground-truth leaf node. Extensive experiments on multiple real-world hierarchical datasets clearly demonstrate the superior performance of the proposed method.
Paperid:2154
Authors:Stefano Viel · Luca Viano · Volkan Cevher
Abstract:
Abstract:This paper introduces the SOAR framework for imitation learning. SOAR is an algorithmic template which learns a policy from expert demonstrations with a primaldual style algorithm which alternates cost and policy updates. Within the policy updates the SOAR framework prescribe to use an actor critic method with multiple critics to estimate the critic uncertainty and therefore build an optimistic critic fundamental to drive exploration.When instantiated to the tabular setting, we get a provable algorithms dubbed FRA with guarantees matching the best known results in $\epsilon$.Practically, the SOAR template is shown to boost consistently the performance of primal dual IL algorithms building on actor critic routines for the policy updates. Approximately, thanks to SOAR, the required number of episodes to achieve the same performance is reduced by a half.
Paperid:2155
Authors:Yang Li
Abstract:
Despite their remarkable capabilities, large language models often struggle with tasks requiring complex reasoning and planning. While existing approaches like Chainof-Thought prompting and tree search techniques show promise, they are limited by their reliance on predefined heuristics and computationally expensive exploration strategies. We propose Policy-Guided Tree Search (PGTS), a framework that combines reinforcement learning with structured tree exploration to efficiently navigate reasoning paths. Our key innovation is a learned policy that dynamically decides between expanding, branching, backtracking, or terminating exploration, eliminating the need for manual heuristics or exhaustive search. Experiments across mathematical reasoning, logical deduction, and planning benchmarks demonstrate that PGTS achieves superior reasoning performance while significantly reducing computational costs compared to existing methods. These results establish PGTS as a scalable and effective solution for tackling complex reasoning tasks with LLMs.
Paperid:2156
Authors:Karina Zainullina · Aleksandr Golubev · Maria Trofimova · Sergei Polezhaev · Ibragim Badertdinov · Daria Litvintseva · Simon Karasik · Filipp Fisin · Sergei Skvortsov · Maksim Nekrashevich · Anton Shevtsov · Boris Yangel
Abstract:
Abstract:Large language models (LLMs) have recently achieved remarkable results in complex multistep tasks, such as mathematical reasoning and agentic software engineering. However, they often struggle to maintain consistent performance across multiple solution attempts. One effective approach to narrow the gap between average-case and best-case performance is guided test-time search, which explores multiple solution paths to identify the most promising one. Unfortunately, effective search techniques (e.g. MCTS) are often unsuitable for *non-serializable* RL environments, such as Docker containers, where intermediate environment states cannot be easily saved and restored. We investigate two complementary search strategies applicable to such environments: $1$-step lookahead and trajectory selection, both guided by a learned action-value function estimator. On the SWE-bench Verified benchmark, a key testbed for agentic software engineering, we find these methods to double the average success rate of a Qwen-72B-based model, achieving $40.8$%, the new state-of-the-art for open-weights models. Additionally, we show that these techniques are transferable to more advanced closed models, yielding similar improvements with GPT-4o.
Paperid:2157
Authors:Han Zhong · Yutong Yin · Shenao Zhang · Xiaojun Xu · Yuanxin Liu · Yifei Zuo · Zhihan Liu · Boyi Liu · Sirui Zheng · Hongyi Guo · Liwei Wang · Mingyi Hong · Zhaoran Wang
Abstract:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Our framework addresses two critical questions: (1) how to generate highquality reasoning processes during inference automatically, and (2) how to integrate these processes into post-training. We propose the \emph{Bootstrapping Reinforced Thinking Process} (BRiTE) algorithm and demonstrate its theoretical convergence at a rate of $1/T$, where $T$ is the number of iterations. The algorithm operates in two steps. First, it generates high-quality rationales by approximating the desired posterior distribution using a reinforcement learning approach with a novel reward shaping mechanism. Second, it fine-tunes the base LLM by maximizing the joint probability of rationale generation with respect to LLM parameters. Empirical evaluation on GSM8K and MATH benchmarks demonstrates that our approach consistently improves performance across different model sizes without requiring human-annotated thinking processes, outperforming standard chain-of-thought prompting while enhancing existing post-training methods.
Paperid:2158
Authors:Clarissa Lauditi · Blake Bordelon · Cengiz Pehlevan
Abstract:
Previous influential work showed that infinite width limits of neural networks in the lazy training regime are described by kernel machines. Here, we show that neural networks trained in the rich infinitewidth regime in two different settings are also described by kernel machines, but with data-dependent kernels. For both cases, we provide explicit expressions for the kernel predictors and prescriptions to numerically calculate them. To derive the first predictor, we study the large-width limit of feature-learning Bayesian networks, showing how feature learning leads to task-relevant adaptation of layer kernels and preactivation densities. The saddle point equations governing this limit result in a min-max optimization problem that defines the kernel predictor. To derive the second predictor, we study gradient flow training of randomly initialized networks trained with weight decay in the infinite-width limit using dynamical mean field theory (DMFT). The fixed point equations of the arising DMFT defines the task-adapted internal representations and the kernel predictor. We compare our kernel predictors to kernels derived from lazy regime and demonstrate that our adaptive kernels achieve lower test loss on benchmark datasets.
Paperid:2159
Authors:Jeffrey A. Chan-Santiago · praveen tirupattur · Gaurav Kumar Nayak · Gaowen Liu · Mubarak Shah
Abstract:
Dataset distillation has emerged as an effective strategy, significantly reducing training costs and facilitating more efficient model deployment.Recent advances have leveraged generative models to distill datasets by capturing the underlying data distribution. Unfortunately, existing methods require model finetuning with distillation losses to encourage diversity and representativeness. However, these methods do not guarantee sample diversity, limiting their performance.We propose a mode-guided diffusion model leveraging a pre-trained diffusion model without the need to fine-tune with distillation losses. Our approach addresses dataset diversity in three stages: Mode Discovery to identify distinct data modes, Mode Guidance to enhance intra-class diversity, and Stop Guidance to mitigate artifacts in synthetic samples that affect performance.We evaluate our approach on ImageNette, ImageIDC, ImageNet-100, and ImageNet-1K, achieving accuracy improvements of 4.4%, 2.9%, 1.6%, and 1.6%, respectively, over state-of-the-art methods. Our method eliminates the need for fine-tuning diffusion models with distillation losses, significantly reducing computational costs.
Paperid:2160
Authors:Wenqin Liu · Haoze Hou · Erdun Gao · Biwei Huang · Qiuhong Ke · Howard Bondell · Mingming Gong
Abstract:
Scorebased generative models are essential in various machine learning applications, with strong capabilities in generation quality. High-order derivatives (scores) of data density offer deep insights into data distributions, building on the proven effectiveness of first-order scores for modeling and generating synthetic data, unlocking new possibilities for applications. However, learning them often requires complete data, which is often unavailable in fields such as healthcare and finance due to privacy concerns and high costs. To address this, we introduce MissScore, a novel framework for estimating high-order scores with missing data. We derive objective functions for estimating high-order scores under different missing data mechanisms and propose a new algorithm to handle missing data effectively. Our empirical results demonstrate that MissScore efficiently and accurately approximates high-order scores with missing data while improving sampling speed and data quality, as validated through several downstream tasks.
Paperid:2161
Authors:Thalles Silva · Helio Pedrini · Adín Ramírez Rivera
Abstract:
We present SelfOrganizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all the relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together fully characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations oftwo loss functions implementing the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively assess SOP’s learned representations on benchmarks such as retrieval, linear evaluation, fine-tuning, and object detection. SOP achieves the best performance on many retrieval bench marks and shows increased performance gains as more complex encoders are utilized.
Paperid:2162
Authors:Justin Chen · Vincent Cohen-Addad · Alessandro Epasto · Morteza Zadimoghaddam
Abstract:
In the differentially private partition selection problem (a.k.a. set union, key discovery), users hold subsets of items from an unbounded universe. The goal is to output as many items as possible from the union of the users' sets while maintaining userlevel differential privacy. Solutions to this problem are a core building block for many privacy-preserving ML applications including vocabulary extraction in a private corpus, computing statistics over categorical data and learning embeddings over user-provided items. We propose an algorithm for this problem, MaxAdaptiveDegree (MAD), which adaptively reroutes weight from items with weight far above the threshold needed for privacy to items with smaller weight, thereby increasing the probability that less frequent items are output. Our algorithm can be efficiently implemented in massively parallel computation systems allowing scalability to very large datasets. We prove that our algorithm stochastically dominates the standard parallel algorithm for this problem. We also develop a two-round version of our algorithm, MAD2R, where results of the computation in the first round are used to bias the weighting in the second round to maximize the number of items output. In experiments, our algorithms provide the best results across the board among parallel algorithms and scale to datasets with hundreds of billions of items, up to three orders of magnitude larger than those analyzed in prior works.
Paperid:2163
Authors:Zhen Sun · Yunhang Shen · Jie Li · Xing Sun · Pingyang Dai · Liujuan Cao · Rongrong Ji
Abstract:
Inspired by the success of VisionLanguage Mod-els (VLMs), an increasing number of researchershave been dedicated to improving VLMs, achiev-ing encouraging results. Current VLMs are typ-ically process-oriented, indirectly enhancing thequality of extracted image features by optimizingthe structures of the Vision Encoder and Connec-tor components. However, they lack direct meth-ods to improve feature quality, especially whenimages contain rich and detailed semantic infor-mation, where performance can be suboptimal.To address this, we propose DS-VLM (DiffusionSupervision Vision Language Model), a result-oriented, plug-and-play approach. By leveraginga diffusion model to directly supervise the featuresextracted by the two components, the diffusionmodel conditions on the supervised features toreconstruct the original image. Through recon-struction loss, the extracted features are iterativelyoptimized to preserve the detailed semantic andstructural information of the image to the greatestextent. Extensive experiments with various visionencoders and different scales of LLMs validatethe effectiveness of our method.
Paperid:2164
Authors:Antonio Ocello · Daniil Tiapkin · Lorenzo Mancini · Mathieu Lauriere · Eric Moulines
Abstract:
We introduce Mean Field Trust Region Policy Optimization (MFTRPO), a novel algorithm designed to compute approximate Nash equilibria for ergodic Mean Field Games (MFGs) in finite state-action spaces. Building on the well-established performance of TRPO in the reinforcement learning (RL) setting, we extend its methodology to the MFG framework, leveraging its stability and robustness in policy optimization. Under standard assumptions in the MFG literature, we provide a rigorous analysis of MF-TRPO, establishing theoretical guarantees on its convergence. Our results cover both the exact formulation of the algorithm and its sample-based counterpart, where we derive high-probability guarantees and finite sample complexity. This work advances MFG optimization by bridging RL techniques with mean-field decision-making, offering a theoretically grounded approach to solving complex multi-agent problems.
Paperid:2165
Authors:Minsung Hwang · Jaejun Lee · Joyce Whang
Abstract:
Inductive knowledge graph completion aims to predict missing triplets in an incomplete knowledge graph that differs from the one observed during training. While subgraph reasoning models have demonstrated empirical success in this task, their theoretical properties, such as stability and generalization capability, remain unexplored. In this work, we present the first theoretical analysis of the relationship between the stability and the generalization capability for subgraph reasoning models. Specifically, we define stability as the degree of consistency in a subgraph reasoning model's outputs in response to differences in input subgraphs and introduce the Relational Tree Mover’s Distance as a metric to quantify the differences between the subgraphs. We then show that the generalization capability of subgraph reasoning models, defined as the discrepancy between the performance on training data and test data, is proportional to their stability. Furthermore, we empirically analyze the impact of stability on generalization capability using realworld datasets, validating our theoretical findings.
Paperid:2166
Authors:Zhaoyu Zhang · Yang Hua · Guanxiong Sun · Hui Wang · Seán McLoone
Abstract:
Diffusionbased generative models (diffusion models) often require a large amount of data to train a score-based model that learns the score function of the data distribution through denoising score matching. However, collecting and cleaning such data can be expensive, time-consuming, and even infeasible. In this paper, we present a novel theoretical insight for diffusion models that two factors, i.e., the denoiser function hypothesis space and the number of training samples, can affect the denoising score matching error of all training samples. Based on this theoretical insight, it is clearly evident that the total denoising score matching error is hard to be minimized within the denoiser function hypothesis space of existing methods, when training diffusion models with limited data. To address this, we propose a new diffusion model with limited data called Limited Data Diffusion (LD-Diffusion), which consists of two main components: a compressing model and a novel mixed augmentation with fixed probability (MAFP) strategy. Specifically, the compressing model can constrain the complexity of the denoiser function hypothesis space and MAFP can effectively increase the training samples by providing more informative guidance in the compressed hypothesis space than existing data augmentation methods. Extensive experiments on several datasets demonstrate that LD-Diffusion improves the training of diffusion models with limited data and achieves better performance compared to other diffusion models.
Paperid:2167
Authors:Mikołaj Małkiński · Szymon Pawlonka · Jacek Mańdziuk
Abstract:
Abstract visual reasoning (AVR) involves discovering shared concepts across images through analogy, akin to solving IQ test problems. Bongard Problems (BPs) remain a key challenge in AVR, requiring both visual reasoning and verbal description. We investigate whether multimodal large language models (MLLMs) can solve BPs by formulating a set of diverse MLLMsuited solution strategies and testing 4 proprietary and 4 open-access models on 3 BP datasets featuring synthetic (classic BPs) and real-world (Bongard HOI and Bongard-OpenWorld) images. Despite some successes on real-world datasets, MLLMs struggle with synthetic BPs. To explore this gap, we introduce Bongard-RWR, a dataset representing synthetic BP concepts using real-world images. Our findings suggest that weak MLLM performance on classical BPs is not due to the domain specificity, but rather comes from their general AVR limitations.
Paperid:2168
Authors:Changyi He · Yifu Ding · Jinyang Guo · Ruihao Gong · Haotong Qin · Xianglong Liu
Abstract:
Abstract:Although knowledge distillation (KD) is an effective approach to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a large LLM (i.e., the teacher model), it still suffers from high training cost. Existing LLM distillation methods ignore the difficulty difference among different samples, making the distillation of easy samples unnecessary. This leads to high distillation cost. In this paper, we propose difficultyaware knowledge distillation (DA-KD) framework for efficient knowledge distillation, in which we dynamically adjust the distillation dataset based on the difficulty of samples. We further observe existing KD loss cannot perform well when most of samples are difficult in the distillation dataset because of unstable optimization and the neglect of hard samples. Therefore, we also propose a new KD loss called bidirectional discrepancy loss (BDL) for effective KD. Extensive experiments demonstrate that our DA-KD framework is effective and efficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2\% with half training cost and even surpass the teacher model with 4.7$\times$ compression.
Paperid:2169
Authors:Geonho Hwang · Yeachan Park · Wonyeol Lee · Sejun Park
Abstract:
Existing works on the expressive power of neural networks typically assume real network parameters and exact mathematical operations during the evaluation of networks. However, neural networks executed by a computer can only have parameters in a small subset of reals and perform inexact mathematical operations with roundoff errors and overflow. In this work, we study the expressive power of floating-point neural networks, i.e., networks under floating-point arithmetic. We first observe that for floating-point neural networks to represent all functions from floating-point vectors to floating-point numbers, it is necessary to distinguish different inputs: the first layer of a network should be able to generate different outputs for different inputs. We also prove that such distinguishability is sufficient along with mild conditions on activation functions and the target function class. Our result shows that for practical activation functions, neural networks can represent almost all floating-point functions.
Paperid:2170
Authors:Menghua Wu · Umesh Padia · Sean Murphy · Regina Barzilay · Tommi Jaakkola
Abstract:
Identifying variables responsible for changes to a biological system enables applications in drug target discovery and cell engineering. Given a pair of observational and interventional datasets, the goal is to isolate the subset of observed variables that were the targets of the intervention. Directly applying causal discovery algorithms is challenging: the data may contain thousands of variables with as few as tens of samples per intervention, and biological systems do not adhere to classical causality assumptions. We propose a causalityinspired approach to address this practical setting. First, we infer noisy causal graphs from the observational and interventional data. Then, we learn to map the differences between these graphs, along with additional statistical features, to sets of variables that were intervened upon. Both modules are jointly trained in a supervised framework, on simulated and real data that reflect the nature of biological interventions. This approach consistently outperforms baselines for perturbation modeling on seven single-cell transcriptomics datasets. We also demonstrate significant improvements over current causal discovery methods for predicting soft and hard intervention targets across a variety of synthetic data.
Paperid:2171
Authors:Valentin De Bortoli · Alexandre Galashov · J Swaroop Guntupalli · Guangyao Zhou · Kevin Murphy · Arthur Gretton · Arnaud Doucet
Abstract:
Diffusion models generate highquality synthetic data. They operate by defining a continuous-time forward process which gradually adds Gaussian noise to data until fully corrupted. The corresponding reverse process progressively ``denoises" a Gaussian sample into a sample from the data distribution. However, generating high-quality outputs requires many discretization steps to obtain a faithful approximation of the reverse process. This is expensive and has motivated the development of many acceleration methods. We propose to accomplish sample generation by learning the posterior {\em distribution} of clean data samples given their noisy versions, instead of only the mean of this distribution. This allows us to sample from the probability transitions of the reverse process on a coarse time scale, significantly accelerating inference with minimal degradation of the quality of the output. This is accomplished by replacing the standard regression loss used to estimate conditional means with a scoring rule. We validate our method on image and robot trajectory generation, where we consistently outperform standard diffusion models at few discretization steps.
Paperid:2172
Authors:Zelai Xu · Wanjun Gu · Chao Yu · Yi Wu · Yu Wang
Abstract:
Large language model (LLM)based agents have recently shown impressive progress in a variety of domains, including open-ended conversation and multi-step decision-making. However, applying these agents to social deduction games such as Werewolf, which requires both strategic decision-making and free-form language interaction, remains non-trivial. Traditional methods based on Counterfactual Regret Minimization (CFR) or reinforcement learning (RL) typically depend on a predefined action space, making them unsuitable for language games with unconstrained text action space. Meanwhile, pure LLM–based agents often suffer from intrinsic biases and require prohibitively large datasets for fine-tuning. We propose Latent Space Policy Optimization (LSPO), an iterative framework that addresses these challenges by first mapping free-form text to a discrete latent space, where methods like CFR and RL can learn strategic policy more effectively. We then translate the learned policy back into natural language dialogues, which are used to fine-tune an LLM via Direct Preference Optimization (DPO). By iteratively alternating between these stages, our LSPO agent progressively enhances both strategic reasoning and language communication. Experiment results on the Werewolf game show that our method improves the agent's performance in each iteration and outperforms existing Werewolf agents, underscoring its promise for free-form language decision-making.
Paperid:2173
Authors:Shuanghao Bai · Wanqi Zhou · Pengxiang Ding · Wei Zhao · Donglin Wang · Badong Chen
Abstract:
Behavior Cloning (BC) is a widely adopted visual imitation learning method in robot manipulation. Current BC approaches often enhance generalization by leveraging large datasets and incorporating additional visual and textual modalities to capture more diverse information. However, these methods overlook whether the learned representations contain redundant information and lack a solid theoretical foundation to guide the learning process. To address these limitations, we adopt an informationtheoretic perspective and introduce mutual information to quantify and mitigate redundancy in latent representations. Building on this, we incorporate the Information Bottleneck (IB) principle into BC, which extends the idea of reducing redundancy by providing a structured framework for compressing irrelevant information while preserving task-relevant features. This work presents the first comprehensive study on redundancy in latent representations across various methods, backbones, and experimental settings, while extending the generalizability of the IB to BC. Extensive experiments and analyses on the CortexBench and LIBERO benchmarks demonstrate significant performance improvements with IB, underscoring the importance of reducing input data redundancy and highlighting its practical value for more practical applications. The code can be found in supplementary materials.
Paperid:2174
Authors:Joery de Vries · Jinke He · Yaniv Oren · Matthijs T. J. Spaan
Abstract:
MonteCarlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL).However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem.Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning.Drawing inspiration from MCTS, we tailor SMC planners specifically for RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation.This leads to ourTrust-Region TwistedSMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.
Paperid:2175
Authors:Hongyin Zhang · Zifeng Zhuang · Han Zhao · Pengxiang Ding · Hongchao Lu · Donglin Wang
Abstract:
VisionLanguage-Action (VLA) models have shown great potential in general robotic decision-making tasks via imitation learning. However, the variable quality of training data often constrains the performance of these models. On the other hand, offline Reinforcement Learning (RL) excels at learning robust policy models from mixed-quality data. In this paper, we introduce Reinforced robot GPT (ReinboT), a novel end-to-end VLA model that integrates the RL principle of maximizing cumulative reward. ReinboT achieves a deeper understanding of the data quality distribution by predicting dense returns that capture the nuances of manipulation tasks. The dense return prediction capability enables the robot to generate more robust decision-making actions, oriented towards maximizing future benefits. Extensive experiments show that ReinboT achieves state-of-the-art performance on the CALVIN mixed-quality dataset and exhibits superior few-shot learning and out-of-distribution generalization capabilities in real-world tasks.
Paperid:2176
Authors:Jie Peng · Jenna Ballard · Mohan Zhang · Sukwon Yun · Jiayi Xin · Qi Long · Yanyong Zhang · Tianlong Chen
Abstract:
Abstract:Medical multimodal learning requires an effective fusion capability of various heterogeneous modalities.One vital challenge is how to effectively fuse modalities when their data quality varies across different modalities and patients.For example, in the TCGA benchmark, the performance of the same modality can differ between types of cancer. Moreover, data collected at different times, locations, and with varying reagents can introduce inter-modal data quality differences ($i.e.$, $\textbf{Modality Batch Effect}$).In response, we propose ${\textbf{A}}$daptive ${\textbf{M}}$odality Token Re-Balan${\textbf{C}}$ing ($\texttt{AMC}$), a novel top-down dynamic multi-modal fusion approach.The core of $\texttt{AMC}$ is to quantify the significance of each modality (Top) and then fuse them according to the modality importance (Down).Specifically, we access the quality of each input modality and then replace uninformative tokens with inter-modal tokens, accordingly.The more important a modality is, the more informative tokens are retained from that modality.The self-attention will further integrate these mixed tokens to fuse multi-modal knowledge.Comprehensive experiments on both medical and general multi-modal datasets demonstrate the effectiveness and generalizability of $\texttt{AMC}$.
Paperid:2177
Authors:Christophe Roux · David Martinez-Rubio · Sebastian Pokutta
Abstract:
We introduce a Riemannian optimistic online learning algorithm for Hadamard manifolds based on inexact implicit updates. Unlike prior work, our method can handle inmanifold constraints, and matches the best known regret bounds in the Euclidean setting with no dependence on geometric constants, like the minimum curvature. Building on this, we develop algorithms for g-convex, g-concave smooth min-max problems on Hadamard manifolds. Notably, one method nearly matches the gradient oracle complexity of the lower bound for Euclidean problems, for the first time.
Paperid:2178
Authors:Amirhossein Kazemnejad · Milad Aghajohari · Eva Portelance · Alessandro Sordoni · Siva Reddy · Aaron Courville · Nicolas Le Roux
Abstract:
Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receiving any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a common reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, recent approaches achieve strong results without it, raising questions about the efficacy of value networks in practice. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoningheavy LLM tasks, showing that they often produce poor estimate of expected return and barely outperform a random baseline when comparing alternative steps. This motivates our key question: Can improved credit assignment enhance RL training for LLMs? To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates. Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time (up to 3.0x). Crucially, it achieves higher test accuracy for a given training accuracy, capturing more generalization signal per sample. These results emphasize the importance of accurate credit assignment in RL training of LLM.
Paperid:2179
Authors:Stephan Rabanser · Ali Shahin Shamsabadi · Olive Franzese · Xiao Wang · Adrian Weller · Nicolas Papernot
Abstract:
Cautious predictions—where a machine learning model abstains when uncertain—are crucial for limiting harmful errors in safetycritical applications. In this work, we identify a novel threat: a dishonest institution can exploit these mechanisms to discriminate or unjustly deny services under the guise of uncertainty. We demonstrate the practicality of this threat by introducing an uncertainty-inducing attack called Mirage, which deliberately reduces confidence in targeted input regions, thereby covertly disadvantaging specific individuals. At the same time, Mirage maintains high predictive performance across all data points. To counter this threat, we propose Confidential Guardian, a framework that analyzes calibration metrics on a reference dataset to detect artificially suppressed confidence. Additionally, it employs zero-knowledge proofs of verified inference to ensure that reported confidence scores genuinely originate from the deployed model. This prevents the provider from fabricating arbitrary model confidence values while protecting the model’s proprietary details. Our results confirm that Confidential Guardian effectively prevents the misuse of cautious predictions, providing verifiable assurances that abstention reflects genuine model uncertainty rather than malicious intent.
Paperid:2180
Authors:Philippe Hansen-Estruch · David Yan · Ching-Yao Chuang · Orr Zohar · Jialiang Wang · Tingbo Hou · Tao Xu · Sriram Vishwanath · Peter Vajda · Xinlei Chen
Abstract:
Visual tokenization via autoencoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. However, questions remain about how auto-encoder design impacts reconstruction and downstream generative performance. This work explores scaling in auto-encoders for reconstruction and generation by replacing the convolutional backbone with an enhanced Vision Transformer for Tokenization (ViTok). We find scaling the auto-encoder bottleneck correlates with reconstruction but exhibits a nuanced relationship with generation. Separately, encoder scaling yields no gains, while decoder scaling improves reconstruction with minimal impact on generation. As a result, we determine that scaling the current paradigm of auto-encoders is not effective for improving generation performance. Coupled with Diffusion Transformers, ViTok achieves competitive image reconstruction and generation performance on 256p and 512p ImageNet-1K. In videos, ViTok achieves SOTA reconstruction and generation performance on 16-frame 128p UCF-101.
Paperid:2181
Authors:Maksim Zhdanov · Max Welling · Jan-Willem van de Meent
Abstract:
Largescale physical systems defined on irregular grids pose significant scalability challenges for deep learning methods, especially in the presence of long-range interactions and multi-scale coupling. Traditional approaches that compute all pairwise interactions, such as attention, become computationally prohibitive as they scale quadratically with the number of nodes. We present Erwin, a hierarchical transformer inspired by methods from computational many-body physics, which combines the efficiency of tree-based algorithms with the expressivity of attention mechanisms. Erwin employs ball tree partitioning to organize computation, which enables linear-time attention by processing nodes in parallel within local neighborhoods of fixed size. Through progressive coarsening and refinement of the ball tree structure, complemented by a novel cross-ball interaction mechanism, it captures both fine-grained local details and global features. We demonstrate Erwin's effectiveness across multiple domains, including cosmology, molecular dynamics, and particle fluid dynamics, where it consistently outperforms baseline methods both in accuracy and computational efficiency.
Paperid:2182
Authors:Muhammed Yusuf Kocyigit · Eleftheria Briakou · Daniel Deutsch · Jiaming Luo · Colin Cherry · Markus Freitag
Abstract:
Data contamination—the accidental consumption of evaluation examples within the pretraining data—can undermine the validity of evaluation benchmarks. In this paper, we present a rigorous analysis of the effects of contamination on language models at 1B and 8B scales on the machine translation task. Starting from a carefully decontaminated train-test split, we systematically introduce contamination at various stages, scales, and data formats to isolate its effect and measure its impact on performance metrics. Our experiments reveal that contamination with both source and target substantially inflates BLEU scores, and this inflation is 2.5 times larger (up to 30 BLEU points) for 8B compared to 1B models. In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations. Finally, we study how the temporal distribution and frequency of contaminated samples influence performance over-estimation across languages with varying degrees of data resources.
Paperid:2183
Authors:Dmitrii Kharlapenko · Stepan Shabalin · Arthur Conmy · Neel Nanda
Abstract:
Sparse autoencoders (SAEs) are a popular toolfor interpreting large language model activations,but their utility in addressing open questions ininterpretability remains unclear. In this work, wedemonstrate their effectiveness by using SAEsto deepen our understanding of the mechanismbehind incontext learning (ICL). We identify abstract SAE features that (i) encode the model’sknowledge of which task to execute and (ii) whoselatent vectors causally induce the task zero-shot.This aligns with prior work showing that ICL ismediated by task vectors. We further demon-strate that these task vectors are well approximated by a sparse sum of SAE latents, including these task-execution features. To explore theICL mechanism, we adapt the sparse feature circuits methodology of Marks et al. (2024) to workfor the much larger Gemma-1 2B model, with30 times as many parameters, and to the morecomplex task of ICL. Through circuit finding, wediscover task-detecting features with corresponding SAE latents that activate earlier in the prompt,that detect when tasks have been performed. Theyare causally linked with task-execution featuresthrough the attention and MLP sublayers.
Paperid:2184
Authors:Hugh Dance · Pierre Glaser · Peter Orbanz · Ryan P. Adams
Abstract:
With the advent of automatic vectorization tools (e.g., \verb|vmap|), writing multichain MCMC algorithms is often now as simple as invoking those tools on single-chain code. Whilst convenient, for various MCMC algorithms this results in a synchronization problem---loosely speaking, at each iteration all chains running in parallel must wait until the last chain has finished drawing its sample. In this work, we show how to design single-chain MCMC algorithms in a way that avoids these synchronization overheads when vectorizing with \verb|vmap|, by using the framework of finite state machines (FSMs). Using a simplified model, we derive (i) an exact theoretical form of the obtainable speed-ups using our approach; and (ii) principled recommendations for optimal algorithm design. We implement several popular MCMC algorithms as FSMs, including Elliptical Slice Sampling, HMC-NUTS, and Delayed Rejection, empirically demonstrating speed-ups of up to an order of magnitude in experiments.
Paperid:2185
Authors:Andreas Kalavas · Evangelos Kipouridis · Nithin Varma
Abstract:
Abstract:In the Correlation Clustering problem, we are given an undirected graph and are tasked with computing a clustering (partition of the nodes) that minimizes the number of edges across different clusters plus the number of nonedges within clusters.In the constrained version of this problem, the goal is to compute a clustering that satisfies additional hard constraints requiring certain pairs to be in the same cluster and certain pairs to be in different clusters.Constrained Correlation Clustering is APX-Hard, and the best known approximation factor is $3$ (van Zuylen *et al.* [SODA '07]).In this work, we present an algorithm that gives a better-than-$2$ approximation, thereby significantly improving the state-of-the-art.
Paperid:2186
Authors:Alexander Norcliffe · Changhee Lee · Fergus Imrie · M van der Schaar · Pietro Lió
Abstract:
Active Feature Acquisition is an instancewise, sequential decision making problem. The aim is to dynamically select which feature to measure based on current observations, independently for each test instance. Common approaches either use Reinforcement Learning, which experience training difficulties; or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic acquisitions. To address these shortcomings, we introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a stochastic latent space. Extensive evaluation on a large range of synthetic and real datasets demonstrates that our approach reliably outperforms a diverse set of baselines.
Paperid:2187
Authors:Yan Chen · Jerry Bai · Yiteng Zhang · Maria Dimakopoulou · Shi Dong · Qi Sun · Zhengyuan Zhou
Abstract:
Abstract:Designing learning agents that explore efficiently in a complex environment has been widely recognized as a fundamental challenge in reinforcement learning.While a number of works have demonstrated the effectiveness of techniques based on randomized value functions on a single agent, it remains unclear, from a theoretical point of view, whether injecting randomization can help a society of agents {\it concurently} explore an environment.The theoretical results %that we established in this work tender an affirmative answer to this question. We adapt the concurrent learning framework to \textit{randomized leastsquares value iteration} (RLSVI) with \textit{aggregated state representation}. We demonstrate polynomial worst-case regret bounds in both finite- and infinite-horizon environments.In both setups the per-agent regret decreases at an optimal rate of $\Theta\left(\frac{1}{\sqrt{N}}\right)$, highlighting the advantage of concurent learning. Our algorithm exhibits significantly lower space complexity compared to \cite{russo2019worst} and \cite{agrawal2021improved}. We reduce the space complexity by a factor of $K$ while incurring only a $\sqrt{K}$ increase in the worst-case regret bound, compared to \citep{agrawal2021improved,russo2019worst}. Interestingly, our algorithm improves the worst-case regret bound of \cite{russo2019worst} by a factor of $H^{1/2}$, matching the improvement in \cite{agrawal2021improved}. However, this result is achieved through a fundamentally different algorithmic enhancement and proof technique. Additionally, we conduct numerical experiments to demonstrate our theoretical findings.
Paperid:2188
Authors:Guoxuan Chen · Han Shi · jiawei li · Yihang Gao · Xiaozhe Ren · Yimeng Chen · Xin Jiang · Zhenguo Li · Weiyang Liu · Chao Huang
Abstract:
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless special tokens (i.e., separators) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plugand-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM's effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.
Paperid:2189
Authors:Meilin Wang · Wei Huang · Mingming Gong · Zheng Zhang
Abstract:
Density ratio estimation(DRE) is a paramount task in machine learning, for its broad applications across multiple domains, such as covariate shift adaptation, causal inference, independence tests and beyond. Parametric methods for estimating the density ratio possibly lead to biased results if models are misspecified, while conventional nonparametric methods suffer from the curse of dimensionality when the dimension of data is large. To address these challenges, in this paper, we propose a novel approach for DRE based on the projection pursuit (PP) approximation. The proposed method leverages PP to mitigate the impact of high dimensionality while retaining the model flexibility needed for the accuracy of DRE. We establish the consistency and the convergence rate for the proposed estimator. Experimental results demonstrate that our proposed method outperforms existing alternatives in various applications.
Paperid:2190
Authors:Dan-Xuan Liu · Chao Qian
Abstract:
Abstract:The subset selection problem with a monotone and submodular objective function under a linear cost constraint has wide applications, such as maximum coverage, influence maximization, and feature selection, just to name a few. Various greedy algorithms have been proposed with good performance both theoretically and empirically. Recently, evolutionary algorithms (EAs), inspired by Darwin's evolution theory, have emerged as a prominent methodology, offering both empirical advantages and theoretical guarantees. Among these, the multiobjective EA, POMC, has demonstrated the best empirical performance to date, achieving an approximation guarantee of $(1/2)(1-1/e)$. However, there remains a gap in the approximation bounds of EAs compared to greedy algorithms, and their full theoretical potential is yet to be realized. In this paper, we re-analyze the approximation performance of POMC theoretically, and derive an improved guarantee of $1/2$, which thus provides theoretical justification for its encouraging empirical performance. Furthermore, we propose a novel multi-objective EA, EPOL, which not only achieves the best-known practical approximation guarantee of $0.6174$, but also delivers superior empirical performance in applications of maximum coverage and influence maximization. We hope this work can help better solving the subset selection problem, but also enhance our theoretical understanding of EAs.
Paperid:2191
Authors:Qitian Wu · Chenxiao Yang · Kaipeng Zeng · Michael Bronstein
Abstract:
The capability of generalization is a cornerstone for the success of modern learning systems. For nonEuclidean data that particularly involves topological structures, one important aspect neglected by prior studies is how learning-based models generalize under topological shifts. This paper makes steps towards addressing the topological shifts through the lens of a principled machine learning model dubbed as Advective Diffusion Transformer. The model is derived from advective diffusion equations which describe a class of continuous message passing process with observed and latent topological structures. To justify the model designs, we show that the proposed model has provable capacity for preserving desired robustness against topological shifts, which in contrast cannot be guaranteed by commonly used local diffusion models. Empirically, the model demonstrates superiority in various downstream predictive tasks across information networks, molecular screening and protein interactions.
Paperid:2192
Authors:Xingyu Chen · Jiahao Xu · Tian Liang · Zhiwei He · Jianhui Pang · Dian Yu · Linfeng Song · Qiuzhi Liu · Mengfei Zhou · Zhuosheng Zhang · Rui Wang · Zhaopeng Tu · Haitao Mi · Dong Yu
Abstract:
The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate humanlike long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.
Paperid:2193
Authors:Doron Haviv · Aram-Alexandre Pooladian · Dana Pe'er · Brandon Amos
Abstract:
Generative modeling typically concerns transporting a single source distribution to a target distribution via simple probability flows. However, in fields like computer graphics and singlecell genomics, samples themselves can be viewed as distributions, where standard flow matching ignores their inherent geometry. We propose Wasserstein flow matching (WFM), which lifts flow matching onto families of distributions using the Wasserstein geometry. Notably, WFM is the first algorithm capable of generating distributions in high dimensions, whether represented analytically (as Gaussians) or empirically (as point-clouds). Our theoretical analysis establishes that Wasserstein geodesics constitute proper conditional flows over the space of distributions, making for a valid FM objective. Our algorithm leverages optimal transport theory and the attention mechanism, demonstrating versatility across computational regimes: exploiting closed-form optimal transport paths for Gaussian families, while using entropic estimates on point-clouds for general distributions. WFM successfully generates both 2D \& 3D shapes and high-dimensional cellular microenvironments from spatial transcriptomics data. Code is available atWassersteinFlowMatching.
Paperid:2194
Authors:Mason O. Smith · Wenlong Zhang
Abstract:
Humans are often modeled as rational actors by interactive agents when they are in fact frequently observed to make biased decisions. This erroneous assumption may cause an agent's model of the human to fail, especially when interaction occurs in biasinducing settings that prompt risky decisions. To address this, this paper formulates a risk-sensitive multi-agent coordination problem and presents the novel Risk-Sensitive Theory of Mind (RS-ToM) framework that allows an autonomous agent to reason about and adapt to a partner of unknown risk-sensitivity. In simulated studies, we show that an agent with a RS-ToM is able to better coordinate with such a partner better than one that assumes they are rational. Thus, we observe significant improvements to team performance, coordination fluency, compliance with partner risk-preferences, and predictability. The presented results suggest that a RS-ToM will be able to model and plan with partners that exhibit these risk-sensitive biases in the real world.
Paperid:2195
Authors:Jun Wu · jiangtao wen · Yuxing Han
Abstract:
The rapid advancement of largelanguage models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (RCT), a novel training-time compression approach based on rate-distortion optimization (RDO). RCT enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that RCT can reduce memory usage by 60\% - 90\% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, RCT proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80\% pruning rates), and enables network simplification for accelerated inference on edge devices.
Paperid:2196
Authors:Kevin Xiao · Noah Marshall · Atish Agarwala · Elliot Paquette
Abstract:
In recent years, SignSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that SignSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of SignSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of SignSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.
Paperid:2197
Authors:Shelvia Wongso · Rohan Ghosh · Mehul Motani
Abstract:
Abstract:Estimating the confidence of deep neural network predictions is crucial for ensuring safe deployment in highstakes applications. Softmax probabilities, though commonly used, are often poorly calibrated, and existing calibration methods have been shown to be harmful for failure prediction tasks. In this paper, we propose to use information-theoretic measures to estimate the confidence of predictions from trained networks in a post-hoc manner, without needing to modify their architecture or training process. In particular, we compare three pointwise information (PI) measures: pointwise mutual information (PMI), pointwise $\mathcal{V}$-information (PVI), and the recently proposed pointwise sliced mutual information (PSI). We show in this paper that these PI measures naturally relate to confidence estimation. We first study the invariance properties of these PI measures with respect to a broad range of transformations. We then study the sensitivity of the PI measures to geometric attributes such as margin and intrinsic dimensionality, as well as their convergence rates. We finally conduct extensive experiments on benchmark computer vision models and datasets and compare the effectiveness of these measures as tools for confidence estimation. A notable finding is that PVI is better than PMI and PSI for failure prediction and confidence calibration, outperforming all existing baselines for post-hoc confidence estimation. This is consistent with our theoretical findings, which suggest that PVI is the most well-balanced measure in terms of its invariance properties and sensitivity to geometric feature properties such as sample-wise margin.
Paperid:2198
Authors:Jiahe Du · Kaixiong Zhou · Xinyu Hong · Zhaozhuo Xu · Jinbo Xu · Xiao Huang
Abstract:
Generating novel enzymes for target molecules in zeroshot scenarios is a fundamental challenge in biomaterial synthesis and chemical production. Without known enzymes for a target molecule, training generative models becomes difficult due to the lack of direct supervision. To address this, we propose a retrieval-augmented generation method that uses existing enzyme-substrate data to guide enzyme design. Our method retrieves enzymes with substrates that share structural similarities with the target molecule, leveraging functional similarities in catalytic activity. Since none of the retrieved enzymes directly catalyze the target molecule, we use a conditioned discrete diffusion model to generate new enzymes based on the retrieved examples. An enzyme-substrate relationship classifier guides the generation process to ensure optimal protein sequence distributions. We evaluate our model on enzyme design tasks with diverse real-world substrates and show that it outperforms existing protein generation methods in catalytic capability, foldability, and docking accuracy. Additionally, we define the zero-shot substrate-specified enzyme generation task and introduce a dataset with evaluation benchmarks.
Paperid:2199
Authors:Zihan Chen · Song Wang · Zhen Tan · Jundong Li · Cong Shen
Abstract:
InContext Learning (ICL) empowers Large Language Models (LLMs) to tackle diverse tasks by incorporating multiple input-output examples, known as demonstrations, into the input of LLMs. More recently, advancements in the expanded context windows of LLMs have led to many-shot ICL, which uses hundreds of demonstrations and outperforms few-shot ICL, which relies on fewer examples. However, this approach is often hindered by the high cost of obtaining large amounts of labeled data. To address this challenge, we proposeMany-ShotAdaptivePseudo-LabEling, namelyMAPLE, a novel influence-based many-shot ICL framework that utilizes pseudo-labeled samples to compensate for the lack of label information. We first identify a subset of impactful unlabeled samples and perform pseudo-labeling on them by querying LLMs. These pseudo-labeled samples are then adaptively selected and tailored to each test query as input to improve the performance of many-shot ICL without significant labeling costs.Extensive experiments on real-world datasets demonstrate the effectiveness of our framework, showcasing its ability to enhance LLM adaptability and performance with limited labeled data. Our code is provided at https://anonymous.4open.science/r/MAPLE_ICL-5E3D.
Paperid:2200
Authors:Zechen Bai · Hai Ci · Mike Zheng Shou
Abstract:
Synthetic videos nowadays is widely used to complement data scarcity and diversity of realworld videos. Current synthetic datasets primarily replicate real-world scenarios, leaving impossible, counterfactual and anti-reality video concepts underexplored. This work aims to answer two questions: 1) Are today's video understanding models good enough for understanding impossible videos? 2) Can today's video generation models effectively follow prompts to create impossible video content? To this end, we introduce IPV-Bench, a novel benchmark designed to evaluate and foster progress in video understanding and generation. IPV-Bench is underpinned by a comprehensive taxonomy, encompassing 4 domains, 14 categories. It features diverse scenes that defy physical, biological, geographical, or social laws. Based on the taxonomy, a prompt suite is constructed to evaluate video generation models, challenging their prompt following and creativity capabilities. In addition, a video benchmark is curated to assess Video-LLMs on their ability of understanding impossible videos, which particularly requires reasoning on temporal dynamics and world knowledge. Comprehensive evaluations reveal limitations and insights for future directions of video models, paving the way for next-generation video models.
Paperid:2201
Authors:Xin Yu · Zelin He · Ying Sun · Lingzhou Xue · Runze Li
Abstract:
Personalized federated learning offers a flexible framework for aggregating information across distributed clients with heterogeneous data. This work considers a personalized federated learning setting that simultaneously learns global and local models. While purely local training has no communication costs, collaborative learning among the clients can leverage shared knowledge to improve statistical accuracy, presenting an accuracycommunication trade-off in personalized federated learning. However, a theoretical analysis of how personalization quantitatively influences sample and algorithmic efficiency and their inherent trade-off is largely unexplored. In this paper, we address this gap and offer theoretical insights for choosing the personalization degree. The theoretical result is validated on both synthetic and real-world datasets and its generalizability is verified in a non-convex setting.
Paperid:2202
Authors:Tom Jacobs · Chao Zhou · Rebekka Burkholz
Abstract:
Implicit bias plays an important role in explaining how overparameterized models generalize well. Explicit regularization like weight decay is often employed in addition to prevent overfitting. While both concepts have been studied separately, in practice, they often act in tandem. Understanding their interplay is key to controlling the shape and strength of implicit bias, as it can be modified by explicit regularization. To this end, we incorporate explicit regularization into the mirror flow framework and analyze its lasting effects on the geometry of the training dynamics, covering three distinct effects: positional bias, type of bias, and range shrinking. Our analytical approach encompasses a broad class of problems, including sparse coding, matrix sensing, singlelayer attention, and LoRA, for which we demonstrate the utility of our insights. To exploit the lasting effect of regularization and highlight the potential benefit of dynamic weight decay schedules, we propose to switch off weight decay during training, which can improve generalization, as we demonstrate in experiments.
Paperid:2203
Authors:Nikita Durasov · Assaf Shocher · Doruk Oner · Gal Chechik · Alexei Efros · EPFL Pascal Fua
Abstract:
Deep learning models often struggle when deployed in realworld settings due to distribution shifts between training and test data. While existing approaches like domain adaptation and test-time training (TTT) offer partial solutions, they typically require additional data or domain-specific auxiliary tasks. We present Idempotent Test-Time Training (IT3), a novel approach that enables on-the-fly adaptation to distribution shifts using only the current test instance, without any auxiliary task design. Our key insight is that enforcing idempotence---where repeated applications of a function yield the same result---can effectively replace domain-specific auxiliary tasks used in previous TTT methods. We theoretically connect idempotence to prediction confidence and demonstrate that minimizing the distance between successive applications of our model during inference leads to improved out-of-distribution performance. Extensive experiments across diverse domains (including image classification, aerodynamics prediction, and aerial segmentation) and architectures (MLPs, CNNs, GNNs) show that IT3 consistently outperforms existing approaches while being simpler and more widely applicable. Our results suggest that idempotence provides a universal principle for test-time adaptation that generalizes across domains and architectures.
Paperid:2204
Authors:John Stewart Fabila-Carrasco · He Sun
Abstract:
Abstract:Given two weighted graphs $G = (V, E, w_G)$ and $H = (V, F, w_H)$ defined on the same vertex set, the constrained clustering problem seeks to find a subset $S \subset V$ that minimises the cut ratio between $w_G(S, V \setminus S)$ and $w_H(S, V \setminus S)$. In this work, we establish a Cheegertype inequality that relates the solution of the constrained clustering problem to the spectral properties of $ G$ and $H$. To reduce computational complexity, we utilise the signed Laplacian of $H$, streamlining calculations while maintaining accuracy. By solving a generalised eigenvalue problem, our proposed algorithm achieves notable performance improvements, particularly in challenging scenarios where traditional spectral clustering methods struggle. We demonstrate its practical effectiveness through experiments on both synthetic and real-world datasets.
Paperid:2205
Authors:Pierre Ablin · Angelos Katharopoulos · Skyler Seto · David Grangier
Abstract:
Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances.We propose the Soupof-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights.To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights.We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly.Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a model size constraint.
Paperid:2206
Authors:Jinghan Li · Zhicheng Sun · Yadong Mu
Abstract:
In the endeavor to make autonomous robots take actions, task planning is a major challenge that requires translating highlevel task descriptions to long-horizon action sequences. Despite recent advances in language model agents, they remain prone to planning errors and limited in their ability to plan ahead. To address these limitations in robotic planning, we advocate a self-refining scheme that iteratively refines a draft plan until an equilibrium is reached. Remarkably, this process can be optimized end-to-end from an analytical perspective without the need to curate additional verifiers or reward models, allowing us to train self-refining planners in a simple supervised learning fashion. Meanwhile, a nested equilibrium sequence modeling procedure is devised for efficient closed-loop planning that incorporates useful feedback from the environment (or an internal world model). Our method is evaluated on the VirtualHome-Env benchmark, showing advanced performance with improved scaling w.r.t. inference-time computation. Code is available at https://github.com/anonymous-icml-2025/equilibrium-planner.
Paperid:2207
Authors:Yaofo Chen · Zeng You · Shuhai Zhang · Haokun Li · Yirui Li · Yaowei Wang · Mingkui Tan
Abstract:
Transformerbased Large Language Models (LLMs) have exhibited remarkable success in extensive tasks primarily attributed to self-attention mechanism, which requires a token to consider all preceding tokens as its context to compute attention. However, when the context length L becomes very large (e.g., 128K), the amount of potentially redundant information in the context tends to increase. The redundant context not only hampers the modeling representation performance but also incurs unnecessary computational and storage overhead. In this paper, we propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling, comprising two complementary modules: 1) Globality-aware pooling module groups input tokens and dynamically compresses each group into one core token based on their significance. In this way, our method automatically focuses and strengthens core context while diminishing redundancy during the learning process, leading to effective long-term dependency modeling. 2) Locality-preserving module incorporates neighboring tokens to preserve local context for detailed representation. Notably, our CCA-Attention is able to replace the self-attention module in existing LLMs with minimal fine-tuning cost. Extensive experimental results show the superiority of our method in both long-context modeling and computational efficiency over state-of-the-art methods.
Paperid:2208
Authors:Mengdi Zhang · Goh Kiat · Peixin Zhang · Jun Sun · Lin Rose · Hongyu Zhang
Abstract:
Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when generating harmful or untruthful responses. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.
Paperid:2209
Authors:Terje Mildner · Oliver Hamelijnck · Paris Giampouras · Theodoros Damoulas
Abstract:
We introduce FedGVI, a probabilistic Federated Learning (FL) framework that is provably robust to both prior and likelihood misspecification. FedGVI addresses limitations in both frequentist and Bayesian FL by providing unbiased predictions under model misspecification, with calibrated uncertainty quantification. Our approach generalises previous FL approaches, specifically Partitioned Variational Inference (Ashman et al., 2022), by allowing robust and conjugate updates, decreasing computational complexity at the clients. We offer theoretical analysis in terms of fixedpoint convergence, optimality of the cavity distribution, and provable robustness. Additionally, we empirically demonstrate the effectiveness of FedGVI in terms of improved robustness and predictive performance on multiple synthetic and real world classification data sets.
Paperid:2210
Authors:Fangwen Wu · Lechao Cheng · Shengeng Tang · Xiaofeng Zhu · Chaowei Fang · Dingwen Zhang · Meng Wang
Abstract:
Classincremental learning (CIL) seeks to enable a model to sequentially learn new classes while retaining knowledge of previously learned ones. Balancing flexibility and stability remains a significant challenge, particularly when the task ID is unknown. To address this, our study reveals that the gap in feature distribution between novel and existing tasks is primarily driven by differences in mean and covariance moments. Building on this insight, we propose a novel semantic drift calibration method that incorporates mean shift compensation and covariance calibration. Specifically, we calculate each class's mean by averaging its sample embeddings and estimate task shifts using weighted embedding changes based on their proximity to the previous mean, effectively capturing mean shifts for all learned classes with each new task. We also apply Mahalanobis distance loss for covariance calibration, aligning class-specific embedding covariances between old and current networks to mitigate the covariance shift. Additionally, we integrate a feature-level self-distillation approach to enhance generalization. Comprehensive experiments on commonly used datasets demonstrate the effectiveness of our approach.
Paperid:2211
Authors:Arnas Uselis · Andrea Dittadi · Seong Joon Oh
Abstract:
Compositional generalization is crucial for human intelligence, yet it remains unclear whether scaling data and model size can solve this challenge, particularly for vision models. We study how data scaling affects compositional generalization in a simplified visual setting. Our key finding is that while models can achieve compositional generalization, this ability critically depends on data diversity. Models develop compositional structure in their latent space only when trained with diverse data, otherwise failing to learn compositional representations despite achieving discrimination. We show that high data diversity leads to linear concept representations, which we demonstrate enables efficient compositional learning. Analyzing largescale pretrained models through this framework reveals mixed results, suggesting compositional generalization remains challenging.
Paperid:2212
Authors:Vitaly Feldman · Audra McMillan · Guy Rothblum · Kunal Talwar
Abstract:
Panprivacy was proposed by Dwork et al. (2010) as an approach to designing a private analytics system that retains its privacy properties in the face of intrusions that expose the system's internal state. Motivated by Federated telemetry applications, we study {\em local pan-privacy}, where privacy should be retained under repeated unannounced intrusions {\em on the local state}. We consider the problem of monitoring the count of an event in a federated system, where event occurrences on a local device should be hidden even from an intruder on that device. We show that under reasonable constraints, the goal of providing information-theoretic differential privacy under intrusion is incompatible with collecting telemetry information. We then show that this problem can be solved in a scalable way using standard cryptographic primitives.
Paperid:2213
Authors:Zhongming Yu · Hejia Zhang · Yujie Zhao · Hanxian Huang · Matrix Yao · Ke Ding · Jishen Zhao
Abstract:
Recent developments in Large Language Model (LLM) agents are revolutionizing Autonomous Software Engineering (ASE),enabling automated coding, problem fixes, and feature improvements. However, localization precisely identifying software problems by navigating to relevant code sections -- remains a significant challenge. Current approaches often yield suboptimal results due to a lack of effective integration between LLM agents and precise code search mechanisms. This paper introduces OrcaLoca, an LLM agent framework that improves accuracy for software issue localization by integrating priority-based scheduling for LLM-guided action, action decomposition with relevance scoring, and distance-aware context pruning.Experimental results demonstrate that OrcaLocabecomes the new open-source state-of-the-art (SOTA) in function match rate (65.33%) on SWE-bench Lite.It also improves the final resolved rate of an open-source framework by 6.33 percentage points through its patch generation integration.
Paperid:2214
Authors:Tyler Ingebrand · Adam Thorpe · Ufuk Topcu
Abstract:
A central challenge in transfer learning is designing algorithms that can quickly adapt and generalize to new tasks without retraining. Yet, the conditions of when and how algorithms can effectively transfer to new tasks is poorly characterized. We introduce a geometric characterization of transfer in Hilbert spaces and define three types of inductive transfer: interpolation within the convex hull, extrapolation to the linear span, and extrapolation outside the span. We propose a method grounded in the theory of function encoders to achieve all three types of transfer. Specifically, we introduce a novel training scheme for function encoders using leastsquares optimization, prove a universal approximation theorem for function encoders, and provide a comprehensive comparison with existing approaches such as transformers and meta-learning on four diverse benchmarks. Our experiments demonstrate that the function encoder outperforms state-of-the-art methods on four benchmark tasks and on all three types of transfer.
Paperid:2215
Authors:Benjamin Plaut · Hanlin Zhu · Stuart Russell
Abstract:
Most learning algorithms with formal regret guarantees assume that all mistakes are recoverable and essentially rely on trying all possible behaviors. This approach is problematic when some mistakes arecatastrophic, i.e., irreparable. We propose an online learning problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff in each round represents the chance of avoiding catastrophe that round and try to maximize the product of payoffs (the overall chance of avoiding catastrophe) while allowing a limited number of queries to a mentor. We first show that in general, any algorithm either constantly queries the mentor or is nearly guaranteed to cause catastrophe. However, in settings where the mentor policy class is learnable in the standard online model, we provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows. Conceptually, if a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.
Paperid:2216
Authors:Fabian Spaeh · Atsushi Miyauchi
Abstract:
Maximizing a single submodular set function subject to a cardinality constraint is a wellstudied and central topic in combinatorial optimization. However, finding a set that maximizes multiple functions at the same time is much less understood, even though it is a formulation which naturally occurs in robust maximization or problems with fairness considerations such as fair influence maximization or fair allocation. In this work, we consider the problem of maximizing the minimum over many submodular functions subject to a cardinality constraint, which is known as multiobjective submodular maximization. All known polynomial-time approximation algorithms either obtain a weak approximation guarantee or rely on the evaluation of the multilinear extension. The latter is expensive to evaluate and renders such algorithms impractical. We bridge this gap and introduce the first scalable and practical algorithm that obtains the best-known approximation guarantee. We furthermore introduce a novel application fair centrality maximization and show how it can be addressed via multiobjective submodular maximization. In our experimental evaluation, we show that our algorithm outperforms known algorithms in terms of objective value and running time.
Paperid:2217
Authors:Chengmei Niu · Zhenyu Liao · Zenan Ling · Michael Mahoney
Abstract:
A substantial body of work in machine learning (ML) and randomized numerical linear algebra (RandNLA) has exploited various sorts of random sketching methodologies, including random sampling and random projection, with much of the analysis using JohnsonLindenstrauss and subspace embedding techniques. Recent studies have identified the issue ofinversion bias- the phenomenon that inverses of random sketches arenotunbiased, despite the unbiasedness of the sketches themselves. This bias presents challenges for the use of random sketches in various ML pipelines, such as fast stochastic optimization, scalable statistical estimators, and distributed optimization.In the context of random projection, the inversion bias can be easily corrected for dense Gaussian projections (which are, however, too expensive for many applications). Recent work has shown how the inversion bias can be corrected for sparse sub-gaussian projections.In this paper, we show how the inversion bias can be corrected for random sampling methods, both uniform and non-uniform leverage-based, as well as for structured random projections, including those based on the Hadamard transform. Using these results, we establishproblem-independentlocal convergence rates for sub-sampled Newton methods.
Paperid:2218
Authors:Francesco Cagnetta · Hyunmo Kang · Matthieu Wyart
Abstract:
Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into units that are powerlaw distributed. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars—probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules’ distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the fine details of the learning curve, but not the exponent describing the large-scale behaviour.
Paperid:2219
Authors:Guibin Zhang · Yanwei Yue · Xiangguo Sun · Guancheng Wan · Miao Yu · Junfeng Fang · Kun Wang · Tianlong Chen · Dawei Cheng
Abstract:
Abstract:Recent advancements in large language model (LLM)based agents have demonstrated that collective intelligence can significantly surpass the capabilities of individual agents, primarily due to well-crafted inter-agent communication topologies. Despite the diverse and high-performing designs available, practitioners often face confusion when selecting the most effective pipeline for their specific task: \textit{Which topology is the best choice for my task, avoiding unnecessary communication token overhead while ensuring high-quality solution?} In response to this dilemma, we introduce G-Designer, an adaptive, efficient, and robust solution for multi-agent deployment, which dynamically designs task-aware, customized communication topologies. Specifically, G-Designer models the multi-agent system as a multi-agent network, leveraging a variational graph auto-encoder to encode both the nodes (agents) and a task-specific virtual node, and decodes a task-adaptive and high-performing communication topology. Extensive experiments on six benchmarks showcase that G-Designer is: \textbf{(1) high-performing}, achieving superior results on MMLU with accuracy at $84.50\\%$ and on HumanEval with pass@1 at $89.90\\%$; \textbf{(2) task-adaptive}, architecting communication protocols tailored to task difficulty, reducing token consumption by up to $95.33\\%$ on HumanEval; and \textbf{(3) adversarially robust}, defending against agent adversarial attacks with merely $0.3\\%$ accuracy drop.
Paperid:2220
Authors:Tomoharu Iwata · Shinsaku Sakaue
Abstract:
We propose a datadriven method for reducing the dimensionality of linear programming problems (LPs) by generating instance-specific projection matrices using a neural network-based model. Once the model is trained using multiple LPs by maximizing the expected objective value, we can efficiently find high-quality feasible solutions of newly given LPs. Our method can shorten the computational time of any LP solvers due to its solver-agnostic nature, it can provide feasible solutions by relying on projection that reduces the number of variables, and it can handle LPs of different sizes using neural networks with permutation equivariance and invariance. We also provide a theoretical analysis of the generalization bound for learning a neural network to generate projection matrices that reduce the size of LPs. Our experimental results demonstrate that our method can obtain solutions with higher quality than the existing methods, while its computational time is significantly shorter than solving the original LPs.
Paperid:2221
Authors:Aman Gupta · Shao Tang · Qingquan Song · Sirou Zhu · Jiwoo Hong · Ankan Saha · Viral Gupta · Noah Lee · Eunki Kim · Siyu Zhu · Parag Agrawal · Natesh Pillai · Sathiya Keerthi
Abstract:
Abstract:Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment Algorithms (DAAs) have emerged in which the reward modeling stage of RLHF is skipped by characterizing the reward directly as a function of the policy being learned. Examples include Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO). These methods often suffer from likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably. In this paper, we argue that, for DAAs the reward (function) shape matters. We introduce \textbf{AlphaPO}, a new DAA method that leverages an $\alpha$parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7\% to 10\% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B. The analysis and results presented highlight the importance of the reward shape, and how one can systematically change it to affect training dynamics, as well as improve alignment performance.
Paperid:2222
Authors:Ilias Diakonikolas · Giannis Iakovidis · Daniel Kane · Thanasis Pittas
Abstract:
Abstract:We study the algorithmic problem of robust mean estimation of an identity covariance Gaussian in the presence of meanshift contamination. In this contamination model, we are given a set of points in $\mathbb{R}^d$ generated i.i.d. via the following process. For a parameter $\alpha<1/2$, the $i$-th sample $x_i$ is obtained as follows: with probability $1-\alpha$, $x_i$ is drawn from $\mathcal{N}(\mu, I)$, where $\mu \in \mathbb{R}^d$ is the target mean; and with probability $\alpha$, $x_i$ is drawn from $\mathcal{N}(z_i, I)$, where $z_i$ is unknown and potentially arbitrary. Prior work characterized the information-theoretic limits of this task. Specifically, it was shown that— in contrast to Huber contamination— in the presence of mean-shift contamination consistent estimation is possible. On the other hand, all known robust estimators in the mean-shift model have running times exponential in the dimension. Here we give the first computationally efficient algorithm for high-dimensional robust mean estimation with mean-shift contamination that can tolerate a constant fraction of outliers. In particular, our algorithm has near-optimal sample complexity, runs in sample-polynomial time, and approximates the target mean to any desired accuracy. Conceptually, our result contributes to a growing body of work that studies inference with respect to natural noise models lying in between fully adversarial and random settings.
Paperid:2223
Authors:Lennart Schneider · Martin Wistuba · Aaron Klein · Jacek Golebiowski · Giovanni Zappella · Felice Antonio Merra
Abstract:
Optimal prompt selection is crucial for maximizing large language model (LLM) performance on downstream tasks, especially in blackbox settings where models are only accessible via APIs.Black-box prompt selection is challenging due to potentially large, combinatorial search spaces, lack of gradient information, and high evaluation cost of prompts on a validation set.We propose HbBoPs, a novel method that combines a structural-aware deep kernel Gaussian Process with Hyperband as a multi-fidelity scheduler to efficiently select prompts.HbBoPs uses embeddings of instructions and few-shot exemplars, treating them as modular components within prompts.This enhances the surrogate model's ability to predict which prompt to evaluate next in a sample-efficient manner.Hyperband improves query-efficiency by adaptively allocating resources across different fidelity levels, reducing the number of validation instances required for evaluating prompts.Extensive experiments across ten diverse benchmarks and three LLMs demonstrate that HbBoPs outperforms state-of-the-art methods in both performance and efficiency.
Paperid:2224
Authors:Wen Lai · Alexander Fraser · Ivan Titov
Abstract:
Parameterefficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing techniques, which modify the activations of specific model components. These methods, due to their extremely small parameter counts, show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods. The code for the method is released at https://anonymous.4open.science/r/jola-42C5
Paperid:2225
Authors:Elias Kempf · Simon Schrodi · Max Argus · Thomas Brox
Abstract:
The remarkable generalization performance of contrastive visionlanguage models like CLIP is often attributed to the diversity of their training distributions. However, key questions remain unanswered: Can CLIP generalize to an entirely unseen domain when trained on a diverse mixture of domains (domain generalization)? Can it generalize to unseen classes within partially seen domains (compositional generalization)? What factors affect such generalization? To answer these questions, we trained CLIP models on systematically constructed training distributions with controlled domain diversity and object class exposure. Our experiments show that domain diversity is essential for both domain and compositional generalization, yet compositional generalization can be surprisingly weaker than domain generalization when the training distribution contains a suboptimal subset of the test domain. Through data-centricandmechanistic analyses, we find that successful generalization requires learning of shared representations already in intermediate layers and shared circuitry.
Paperid:2226
Authors:Jintao Tong · Yixiong Zou · Guangyao Chen · Yuhua Li · Ruixuan Li
Abstract:
CrossDomain Few-Shot Segmentation (CD-FSS) aims to transfer knowledge from a large-scale source-domain dataset to unseen target-domain datasets with limited annotated samples. Current methods typically compare the distance between training and testing samples for mask prediction. However, a problem of feature entanglement exists in this well-adopted method, which binds multiple patterns together and harms the transferability. However, we find an entanglement problem exists in this widely adopted method, which tends to bind source-domain patterns together and make each of them hard to transfer. In this paper, we aim to address this problem for the CD-FSS task. We first find a natural decomposition of the ViT structure, based on which we delve into the entanglement problem for an interpretation. We find the decomposed ViT components are crossly compared between images in distance calculation, where the rational comparisons are entangled with those meaningless ones by their equal importance, leading to the entanglement problem. Based on this interpretation, we further propose to address the entanglement problem by learning to weigh for all comparisons of ViT components, which learn disentangled features and re-compose them for the CD-FSS task, benefiting both the generalization and finetuning. Experiments show that our model outperforms the state-of-the-art CD-FSS method by 1.92% and 1.88% in average accuracy under 1-shot and 5-shot settings, respectively.
Paperid:2227
Authors:Shanshan Luo · Yu yixuan · Chunchen LIU · Feng Xie · zhi geng
Abstract:
Previous studies have extensively addressed the attribution problem for binary outcome variables. However, in many practical scenarios, the outcome variable is continuous, and simply binarizing it may result in information loss or biased conclusions. To address this issue, we propose a series of posterior causal estimands for retrospectively evaluating multiple correlated causes from a continuous outcome. These estimands include posterior intervention effects, posterior total causal effects, and posterior natural direct effects. Under assumptions of sequential ignorability, monotonicity, and perfect positive rank, we show that the posterior causal estimands of interest are identifiable and present the corresponding identification equations. We also provide a simple but effective estimation procedure and establish asymptotic properties of the proposed estimators. An artificial hypertension example and a real developmental toxicity dataset are employed to illustrate our method.
Paperid:2228
Authors:Vicente Balmaseda · Bokun Wang · Lin · Tianbao Yang
Abstract:
In selfsupervised contrastive learning, negative pairs are typically constructed using an anchor image and a sample drawn from the entire dataset, excluding the anchor. However, this approach can result in the creation of negative pairs with similar semantics, referred to as "false negatives", leading to their embeddings being falsely pushed apart. To address this issue, we introduceGloFND, an optimization-based approach that automatically learns on the fly the threshold for each anchor data toidentifyits false negatives during training. In contrast to previous methods for false negative discovery, our approachgloballydetects false negatives across the entire dataset rather than locally within the mini-batch. Moreover, its per-iteration computation cost remains independent of the dataset size. Experimental results on image and image-text data demonstrate the effectiveness of the proposed method.
Paperid:2229
Authors:Yujun Kim · Jaeyoung Cha · Chulhee Yun
Abstract:
Abstract:Recent theoretical results demonstrate that the convergence rates of permutationbased SGD (e.g., random reshuffling SGD) are faster than uniform-sampling SGD; however, these studies focus mainly on the large epoch regime, where the number of epochs $K$ exceeds the condition number $\kappa$. In contrast, little is known when $K$ is smaller than $\kappa$, and it is still a challenging open question whether permutation-based SGD can converge faster in this small epoch regime (Safran and Shamir, 2021). As a step toward understanding this gap, we study the naive deterministic variant, Incremental Gradient Descent (IGD), on smooth and strongly convex functions. Our lower bounds reveal that for the small epoch regime, IGD can exhibit surprisingly slow convergence even when all component functions are strongly convex. Furthermore, when some component functions are allowed to be nonconvex, we prove that the optimality gap of IGD can be significantly worse throughout the small epoch regime. Our analyses reveal that the convergence properties of permutation-based SGD in the small epoch regime may vary drastically depending on the assumptions on component functions. Lastly, we supplement the paper with tight upper and lower bounds for IGD in the large epoch regime.
Paperid:2230
Authors:Shuaicheng Niu · Guohao Chen · Peilin Zhao · Tianyi Wang · Pengcheng Wu · Zhiqi Shen
Abstract:
In this paper, we seek to develop a versatile testtime adaptation (TTA) objective for a variety of tasks — classification and regression across image-, object-, and pixel-level predictions. We achieve this through a self-bootstrapping scheme that optimizes prediction consistency between the test image (as target) and its deteriorated view. The key challenge lies in devising effective augmentations/deteriorations that: i) preserve the image’s geometric information, \eg, object sizes and locations, which is crucial for TTA on object/pixel-level tasks, and ii) provide sufficient learning signals for TTA. To this end, we analyze how common distribution shifts affect the image's information power across spatial frequencies in the Fourier domain, and reveal that low-frequency components carry high power and masking these components supplies more learning signals, while masking high-frequency components can not. In light of this, we randomly mask the low-frequency amplitude of an image in its Fourier domain for augmentation. Meanwhile, we also augment the image with noise injection to compensate for missing learning signals at high frequencies, by enhancing the information power there. Experiments show that, either independently or as a plug-and-play module, our method achieves superior results across classification, segmentation, and 3D monocular detection tasks with both transformer and CNN models. \underline{Code will be released}.
Paperid:2231
Authors:Gwen Yidou-Weng · Benjie Wang · Guy Van den Broeck
Abstract:
As large language models (LLMs) advance, there is an increasing need to control their outputs to align with human values (e.g., detoxification) or desired attributes (e.g., personalization, topic). However, autoregressive models focus on nexttoken predictions and struggle with global properties that require looking ahead. Existing solutions either tune or post-train LMs for each new attribute—expensive and inflexible—or approximate the Expected Attribute Probability (EAP) of future sequences by sampling or training, which is slow and unreliable for rare attributes. We introduce TRACE (Tractable Probabilistic Reasoning for Adaptable Controllable Generation), a novel framework that efficiently predicts EAP and adapts to new attributes through tractable generation and lightweight control. TRACE distills a Hidden Markov Model (HMM) from a language model and pairs it with a small classifier to score EAP, reweighting next-token probabilities for globally compliant continuations. Empirically, TRACE achieves state-of-the-art results in detoxification with negligible decoding overhead, adapts to 76 low-resource personalized LLMs within seconds, and seamlessly extends to composite attributes. Our findings underscore TRACE’s strong performance, efficiency, flexibility, and compositionality for ensuring global attribute satisfaction.
Paperid:2232
Authors:Yuta Tarumi · Keisuke Fukuda · Shin-ichi Maeda
Abstract:
Abstract:Data assimilation for nonlinear state space models (SSMs) is inherently challenging due to nonGaussian posteriors. We propose Deep Bayesian Filtering (DBF), a novel approach to data assimilation in nonlinear SSMs. DBF introduces latent variables $h_t$ in addition to physical variables $z_t$, ensuring Gaussian posteriors by (i) constraining state transitions in the latent space to be linear and (ii) learning a Gaussian inverse observation operator $r(h_t|o_t)$. This structured posterior design enables analytical recursive computation, avoiding the accumulation of Monte Carlo sampling errors over time steps. DBF optimizes these operators and other latent SSM parameters by maximizing the evidence lower bound. Experiments demonstrate that DBF outperforms existing methods in scenarios with highly non-Gaussian posteriors.
Paperid:2233
Authors:Nuojin Cheng · Leonard Papenmeier · Stephen Becker · Luigi Nardi
Abstract:
Bayesian optimization is a widely used method for optimizing expensive blackbox functions, with Expected Improvement being one of the most commonly used acquisition functions. In contrast, information-theoretic acquisition functions aim to reduce uncertainty about the function’s optimum and are often considered fundamentally distinct from EI. In this work, we challenge this prevailing perspective by introducing a unified theoretical framework, Variational Entropy Search, which reveals that EI and information-theoretic acquisition functions are more closely related than previously recognized. We demonstrate that EI can be interpreted as a variational inference approximation of the popular information-theoretic acquisition function, named Max-value Entropy Search. Building on this insight, we propose VES-Gamma, a novel acquisition function that balances the strengths of EI and MES. Extensive empirical evaluations across both low- and high-dimensional synthetic and real-world benchmarks demonstrate that VES-Gamma is competitive with state-of-the-art acquisition functions and in many cases outperforms EI and MES.
Paperid:2234
Authors:Mingyu Xu · Bingning Wang · Weipeng Chen
Abstract:
Current large language models (LLMs) predominantly rely on decodeonly transformer architectures, which exhibit exceptional in-context learning (ICL) capabilities. It is widely acknowledged that the cornerstone of their ICL ability lies in the induction heads mechanism, which necessitates at least two layers of attention. To more effectively harness the model's induction capabilities, we revisit the induction heads mechanism and provide theoretical proof that KV shifting attention reduces the model's dependency on the depth and width of the induction heads mechanism. Our experimental results confirm that KV shifting attention enhances the learning of induction heads and improves language modeling performance. This leads to superior performance or accelerated convergence, spanning from toy models to pre-trained models with over 10 billion parameters.
Paperid:2235
Authors:Noam Levi · Yaron Oz
Abstract:
We study universal properties in realworld complex and synthetically generated datasets. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure.Examining the local and global eigenvalue statistics of feature-feature covariance matrices, we find: (i) bulk eigenvalue power-law scaling vastly differs between uncorrelated Gaussian and real-world data, (ii) this power law behavior is reproducible using Gaussian data with long-range correlations, (iii) all dataset types exhibit chaotic RMT universality, (iv) RMT statistics emerge at smaller dataset sizes than typical training sets, correlating with power-law convergence, (v) Shannon entropy correlates with RMT structure and requires fewer samples in strongly correlated datasets. These results suggest natural image Gram matrices can be approximated by Wishart random matrices with simple covariance structure, enabling rigorous analysis of neural network behavior.
Paperid:2236
Authors:Quan Zhou · Changhua Pei · Fei Sun · Jianhui LI · haiming zhang · Gaogang Xie · Dan Pei · Zhengwei Gao · HanJing
Abstract:
Time series anomaly detection (TSAD) underpins realtime monitoring in cloud services and web systems, allowing rapid identification of anomalies to prevent costly failures. Most TSAD methods driven by forecasting models tend to overfit by emphasizing minor fluctuations. Our analysis reveals that effective TSAD should focus on modeling “normal” behavior through smooth local patterns. To achieve this, we reformulate time series modeling as approximating the series with smooth univariate functions. The local smoothness of each univariate function ensures that the fitted time series remains resilient against local disturbances. However, a direct KAN implementation proves susceptible to these disturbances due to the inherently localized characteristics of B-spline functions. We thus proposeKAN-AD, replacing B-splines with truncated Fourier expansions and introducing a novel lightweight learning mechanism that emphasizes global patterns while staying robust to local disturbances. On four popular TSAD benchmarks, KAN-AD achieves an average 15% improvement in detection accuracy (with peaks exceeding 27%) over state-of-the-art baselines. Remarkably, it requires fewer than 1,000 trainable parameters, resulting in a 50% faster inference speed compared to the original KAN, demonstrating the approach’s efficiency and practical viability.
Paperid:2237
Authors:Bernal Jimenez Gutierrez · Yiheng Shu · Weijian Qi · Sizhe Zhou · Yu Su
Abstract:
Our ability to continuously acquire, organize, and leverage knowledge is a key feature of human intelligence that AI systems must approximate to unlock their full potential. Given the challenges in continual learning with large language models (LLMs), retrievalaugmented generation (RAG) has become the dominant way to introduce new information. However, its reliance on vector retrieval hinders its ability to mimic the dynamic and interconnected nature of human long-term memory. Recent RAG approaches augment vectorembeddings with various structures like knowledge graphs to address some of these gaps, namely sense-making and associativity. However, their performance on more basic factual memory tasks drops considerably below standard RAG. We address this unintended deterioration and propose ContinualRAG, a framework that outperforms standard RAG comprehensively on factual, sense-making, and associative memory tasks. ContinualRAG builds upon the Personalized PageRank algorithm used in HippoRAG and enhances it with deeper passage integration and more effective online use of an LLM. This combination pushes this RAG system closer to the effectiveness of human long-term memory, achieving a 7% improvement in associative memory tasks over the state-of-the-art embedding model while also exhibiting superior factual knowledge and sense-making memory capabilities. This work paves the way for non-parametric continual learning for LLMs.
Paperid:2238
Authors:Siqi Zhang · Xing Huang · Feihu Huang
Abstract:
Abstract:Bilevel optimization is widely applied in many machine learning tasks such as hyperparameter learning and meta learning. Recently, many algorithms have been proposed to solve these bilevel optimization problems. However, these proposed algorithms rely on the smoothness condition of objective functions of the bilevel optimization. In fact, some machine learning tasks such as training language model do not satisfy the smoothness condition of objective functions. More recently, some methods have begun to study generalized smooth bilevel optimization. However, these proposed methods for generalized smooth bilevel optimization only focus on the (strongly) convex lower objective function. Meanwhile, these methods only consider the generalized-smooth upper-level objective, but still require the standard smooth lower-level objective in the bilevel optimization. To fill this gap, in the paper, thus we study the generalized-smooth bilevel optimization with the nonconvex lower-level objective function, where both upper-level and lower-level objectives are generalized-smooth. We propose an efficient Hessian/Jacobian-free penalty normalized gradient (i.e., PNGBiO) method. Moreover, we study the convergence analysis of our PNGBiO method, and prove that it obtains a gradient complexity of $O(\epsilon^{-4})$ for finding an $\epsilon$-stationary solution. Some experimental results demonstrate efficiency of our proposed method.
Paperid:2239
Authors:Maximilian Beck · Korbinian Pöppel · Phillip Lippe · Richard Kurle · Patrick Blies · Günter Klambauer · Sebastian Böck · Sepp Hochreiter
Abstract:
Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time.Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference.However, such xLSTMbased LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency.In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM’s architectural benefits with targeted optimizations for fast and efficient inference.Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs.These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation.Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference.
Paperid:2240
Authors:Yuxiang Chen · Haocheng Xi · Jun Zhu · Jianfei Chen
Abstract:
Pretraining Transformers in FP4 precision is becoming a promising approach to gain substantial speedup, but it comes with a considerable loss of accuracy. Microscaling (MX) data format provides a fine-grained per-group quantization method to improve the representation ability of the FP4 format and is supported by the next-generation Blackwell GPU architecture. However, training with MXFP4 data format still results in significant degradation and there is a lack of systematic research on the reason.In this work, we propose a novel training method TetraJet for a more accurate FP4 training. We comprehensively evaluate all of the quantizers involved in the training, and identify the weight oscillation problem in the forward pass as the main source of the degradation in MXFP4 training. Therefore, we introduce two novel methods, EMA Quantizer (Q-EMA) and Adaptive Ramping Optimizer (Q-Ramping), to resolve the oscillation problem. Extensive experiments on Vision Transformers demonstrate that TetraJet consistently outperforms the existing 4-bit training methods, and Q-EMA \& Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than 50% compared to the baseline, and can even achieve competitive performance compared to full precision training.
Paperid:2241
Authors:Hechuan Wen · Tong Chen · Mingming Gong · Li Kheng Chai · Shazia Sadiq · Hongzhi Yin
Abstract:
Although numerous complex algorithms for treatment effect estimation have been developed in recent years, their effectiveness remains limited when handling insufficiently labeled training sets due to the high cost of labeling the effect after treatment, e.g., expensive tumor imaging or biopsy procedures needed to evaluate treatment effects. Therefore, it becomes essential to actively incorporate more highquality labeled data, all while adhering to a constrained labeling budget. To enable data-efficient treatment effect estimation, we formalize the problem through rigorous theoretical analysis within the active learning context, where the derived key measures --factualandcounterfactual covering radiusdetermine the risk upper bound. To reduce the bound, we propose a greedy radius reduction algorithm, which excels under an idealized, balanced data distribution. To generalize to more realistic data distributions, we further propose FCCM, which transforms the optimization objective into theFactualandCounterfactual Coverage Maximizationto ensure effective radius reduction during data acquisition. Furthermore, benchmarking FCCM against other baselines demonstrates its superiority across both fully synthetic and semi-synthetic datasets. Anonymous code is at: https://anonymous.4open.science/r/PaperID_2979.
Paperid:2242
Authors:In Huh · Changwook Jeong · Muhammad Alam
Abstract:
OutOf-Domain (OOD) generalization is a significant challenge in learning dynamical systems, especially when they exhibit bifurcation, a sudden topological transition triggered by a model parameter crossing a critical threshold. A prevailing belief is that machine learning models, unless equipped with strong priors, struggle to generalize across bifurcations due to the abrupt changes in data characteristics. Contrary to this belief, we demonstrate that context-dependent Neural Ordinary Differential Equations (NODEs), trained solely on localized, pre-bifurcation, symmetric data yet without physics-based priors, can identify post-bifurcation, symmetry-breaking behaviors, even in a zero-shot manner. We interpret this capability to the model's implicit utilization of topological invariants, particularly the Poincaré index, and offer a formal explanation based on the Poincaré-Hopf theorem. We derive the conditions under which NODEs can recover—or erroneously hallucinate—broken symmetries without explicit training. Building on this insight, we showcase a topological regularizer inspired by the Poincaré-Hopf theorem, providing empirical justification with the Landau-Khalatnikov system.
Paperid:2243
Authors:Yong Liu · Guo Qin · Zhiyuan Shi · Zhi Chen · Caiyin Yang · Xiangdong Huang · Jianmin Wang · Mingsheng Long
Abstract:
We introduce Sundial, a family of native, flexible, and scalable time series foundation models. To predict the nextpatch's distribution, we propose a TimeFlow Loss based on flow-matching, which facilitates native pre-training of Transformers on time series without discrete tokenization. Conditioned on arbitrary-length time series, our model is pre-trained without specifying any prior distribution and can generate multiple probable predictions, achieving flexibility in representation learning beyond using parametric densities. Towards time series foundation models, we leverage minimal but crucial adaptations of Transformers and curate TimeBench with 1 trillion time points, comprising mostly real-world datasets and synthetic data. By mitigating mode collapse through TimeFlow Loss, we pre-train a family of Sundial models on TimeBench, which exhibit unprecedented model capacity and generalization performance on zero-shot forecasting. In addition to presenting good scaling behavior, Sundial achieves new state-of-the-art on both point forecasting and probabilistic forecasting benchmarks. We believe that Sundial's pioneering generative paradigm will facilitate a wide variety of forecasting scenarios.
Paperid:2244
Authors:Yuheng Jing · Kai Li · Bingyun Liu · Ziwen Zhang · Haobo Fu · Qiang Fu · Junliang Xing · Jian Cheng
Abstract:
Offline Opponent Modeling (OOM) aims to learn an adaptive autonomous agent policy that dynamically adapts to opponents using an offline dataset from multiagent games. Previous work assumes that the dataset is optimal. However, this assumption is difficult to satisfy in the real world. When the dataset is suboptimal, existing approaches struggle to work. To tackle this issue, we propose a simple and general algorithmic improvement framework, Truncated Q-driven Instant Policy Refinement (TIPR), to handle the suboptimality of OOM algorithms induced by datasets. The TIPR framework is plug-and-play in nature. Compared to original OOM algorithms, it requires only two extra steps: (1) Learn a horizon-truncated in-context action-value function, namely Truncated Q, using the offline dataset. The Truncated Q estimates the expected return within a fixed, truncated horizon and is conditioned on opponent information. (2) Use the learned Truncated Q to instantly decide whether to perform policy refinement and to generate policy after refinement during testing. Theoretically, we analyze the rationale of Truncated Q from the perspective of No Maximization Bias probability. Empirically, we conduct extensive comparison and ablation experiments in four representative competitive environments. TIPR effectively improves various OOM algorithms pretrained with suboptimal datasets.
Paperid:2245
Authors:Savelii Chezhegov · Klyukin Yaroslav · Andrei Semenov · Aleksandr Beznosikov · Alexander Gasnikov · Samuel Horváth · Martin Takac · Eduard Gorbunov
Abstract:
Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavytailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam-Norm in handling the heavy-tailed noise.
Paperid:2246
Authors:Geon-Woo Kim · Junbo Li · Shashidhar Gandham · Omar Baldonado · Adithya Gangidi · Pavan Balaji · Zhangyang “Atlas” Wang · Aditya Akella
Abstract:
Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive interregion communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5× faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1×. Crucially, HALoS preserves the model quality of fully synchronous SGD—matching or exceeding accuracy on standard language modeling and downstream benchmarks—while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments.
Paperid:2247
Authors:Abrar Majeedi · Viswanatha Reddy Gajjala · Satya Sai Srinath Namburi GNVV · Nada Elkordi · Yin Li
Abstract:
Realworld time series are often governed by complex and nonlinear underlying dynamics. Understanding these dynamics is therefore crucial for precise future prediction. While deep learning has achieved notable success in time series forecasting, many existing approaches do not explicitly model the dynamics. To bridge this gap, we introduce DeepEDM, a framework that seamlessly integrates nonlinear dynamical systems modeling with deep neural networks. Inspired by empirical dynamic modeling (EDM) and rooted in Takens' theorem, DeepEDM presents a novel neural network that leverages time-delayed embeddings within a latent space, and employs kernel regression to approximate the underlying dynamics, allowing for accurate prediction of future data points. To evaluate our method, we conduct comprehensive experiments on both synthetic and real-world data. Our results show that by incorporating dynamic modeling, DeepEDM significantly improves forecasting accuracy, outperforming state-of-the-art methods.
Paperid:2248
Authors:Tuan Dam
Abstract:
Abstract:MonteCarlo Tree Search (MCTS) has demonstrated success in online planning for deterministic environments, yet significant challenges remain in adapting it to stochastic Markov Decision Processes (MDPs), particularly in continuous state-action spaces. Existing methods, such as HOOT, which combines MCTS with the Hierarchical Optimistic Optimization (HOO) bandit strategy, address continuous spaces but rely on a logarithmic exploration bonus that lacks theoretical guarantees in non-stationary, stochastic settings. Recent advancements, such as Poly-HOOT, introduced a polynomial bonus term to achieve convergence in deterministic MDPs, though a similar theory for stochastic MDPs remains undeveloped. In this paper, we propose a novel MCTS algorithm, Stochastic-Power-HOOT, designed for continuous, stochastic MDPs. Stochastic-Power-HOOT integrates a power mean as a value backup operator, alongside a polynomial exploration bonus to address the non-stationarity inherent in continuous action spaces. Our theoretical analysis establishes that Stochastic-Power-HOOT converges at a polynomial rate of $\mathcal{O}(n^{-1/2})$, where \( n \) is the number of visited trajectories, thereby extending the non-asymptotic convergence guarantees of Poly-HOOT to stochastic environments. Experimental results on synthetic and stochastic tasks validate our theoretical findings, demonstrating the effectiveness of Stochastic-Power-HOOT in continuous, stochastic domains.
Paperid:2249
Authors:Sharan Vaswani · Reza Babanezhad
Abstract:
Abstract:Armijo linesearch (Armijo-LS) is a standard method to set the step-size for gradient descent (GD). For smooth functions, Armijo-LS alleviates the need to know the global smoothness constant $L$ and adapts to the ``local'' smoothness, enabling GD to converge faster. However, existing theoretical analyses of GD with Armijo-LS ($\texttt{GD-LS}$) do not characterize this fast convergence. We show that if the objective function satisfies a certain non-uniform smoothness condition, $\texttt{GD-LS}$ converges provably faster than GD with a constant $1/L$ step-size (denoted as $\texttt{GD(1/L)}$). Our results imply that for convex losses corresponding to logistic regression and multi-class classification, $\texttt{GD-LS}$ can converge to the optimum at a linear rate, and hence, improve over the sublinear convergence of $\texttt{GD(1/L)}$. Furthermore, for non-convex losses satisfying gradient domination (for example, those corresponding to the softmax policy gradient in RL or generalized linear models with a logistic link function), $\texttt{GD-LS}$ can match the fast convergence of algorithms tailored for these specific settings. Finally, we prove that under the interpolation assumption, for convex losses, stochastic GD with a stochastic line-search can match the fast convergence of $\texttt{GD-LS}$.
Paperid:2250
Authors:Lorenzo Magnino · Yuchen Zhu · Mathieu Lauriere
Abstract:
Optimal stopping is a fundamental problem in optimization with applications in risk management, finance, robotics, and machine learning. We extend the standard framework to a multiagent setting, named multi-agent optimal stopping (MAOS), where agents cooperate to make optimal stopping decisions in a finite-space, discrete-time environment. Since solving MAOS becomes computationally prohibitive as the number of agents is very large, we study the mean field optimal stopping (MFOS) problem, obtained as the number of agents tends to infinity. We establish that MFOS provides a good approximation to MAOS and prove a dynamic programming principle (DPP) based on mean-field control theory. We then propose two deep learning approaches: one that learns optimal stopping decisions by simulating full trajectories and another that leverages the DPP to compute the value function and to learn the optimal stopping rule using backward induction. Both methods train neural networks to approximate optimal stopping policies. We demonstrate the effectiveness and the scalability of our work through numerical experiments on 6 different problems in spatial dimension up to 300. To the best of our knowledge, this is the first work to formalize and computationally solve MFOS in discrete time and finite space, opening new directions for scalable multi-agent optimal stopping methods.
Paperid:2251
Authors:Ze Cheng · Wang Xiaoqiang · Zhuoyu Li · Jianing Huang · Zhizhou Zhang · Zhongkai Hao · Hang Su
Abstract:
PDEConstrained Optimization (PDECO) problems can be accelerated significantly by employing gradient-based methods with surrogate models like neural operators compared to traditional numerical solvers. However, this approach faces two key challenges:(1) \textbf{Data efficiency}: Efficiently sampling data and effectively training neural operators for the specific demands of optimization.(2) \textbf{Robustness}: Controlling the risk of optimization derailment due to inaccuracies in neural operator predictions and gradients.To address these challenges, we propose a novel framework: (1) \textbf{Data-driven training}: we leverage data from full steps of traditional optimization algorithms and employ a specialized training method for neural operators. (2) \textbf{Enhanced derivative learning}: We introduce a \textit{Virtual-Fourier} layer to enhance derivative learning within the neural operator, a crucial aspect for gradient-based optimization. (3) \textbf{Hybrid optimization}: We implement a hybrid approach that integrates neural operators with numerical solvers, providing robust regularization for the optimization process.Our extensive experimental results demonstrate the effectiveness of our model in accurately learning operators and their derivatives. Furthermore, our hybrid optimization approach exhibits robust convergence.
Paperid:2252
Authors:Yang Xu · Vaneet Aggarwal
Abstract:
Abstract:We address the problem of quantum reinforcement learning (QRL) under modelfree settings with quantum oracle access to the Markov Decision Process (MDP). This paper introduces a Quantum Natural Policy Gradient (QNPG) algorithm, which replaces the random sampling used in classical Natural Policy Gradient (NPG) estimators with a deterministic gradient estimation approach, enabling seamless integration into quantum systems. While this modification introduces a bounded bias in the estimator, the bias decays exponentially with increasing truncation levels. This paper demonstrates that the proposed QNPG algorithm achieves a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-1.5})$ for queries to the quantum oracle, significantly improving the classical lower bound of $\tilde{\mathcal{O}}(\epsilon^{-2})$ for queries to the MDP.
Paperid:2253
Authors:Ofir Nabati · Guy Tennenholtz · Chih-wei Hsu · Moonkyung Ryu · Deepak Ramachandran · Yinlam Chow · Xiang Li · Craig Boutilier
Abstract:
We address the problem of interactive textto-image (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with large-scale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varying user preference types. We then leverage a large multimodal language model (LMM) and a value-based RL approach to suggest an adaptive and diverse slate of prompt expansions to the user. Our Preference Adaptive and Sequential Text-to-image Agent (PASTA) extends T2I models with adaptive multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user's intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also open-source our sequential rater dataset and simulated user-rater interactions to support future research in user-centric multi-turn T2I systems.
Paperid:2254
Authors:Brian Zhang · Ioannis Anagnostides · Emanuel Tewolde · Ratip Emin Berker · Gabriele Farina · Vincent Conitzer · Tuomas Sandholm
Abstract:
Variational inequalities (VIs) encompass many fundamental problems in diverse areas ranging from engineering to economics and machine learning. However, their considerable expressivity comes at the cost of computational intractability. In this paper, we introduce and analyze a natural relaxationwhich we refer to as expected variational inequalities (EVIs)---where the goal is to find a distribution that satisfies the VI constraint in expectation. By adapting recent techniques from game theory, we show that, unlike VIs, EVIs can be solved in polynomial time under general (nonmonotone) operators. EVIs capture the seminal notion of correlated equilibria, but enjoy a greater reach beyond games. We also employ our framework to capture and generalize several existing disparate results, including from settings such as smooth games, and games with coupled constraints or nonconcave utilities.
Paperid:2255
Authors:Zhanke Zhou · Xiao Feng · Zhaocheng Zhu · Jiangchao Yao · Sanmi Koyejo · Bo Han
Abstract:
While numerous benchmarks evaluate the reasoning abilities of large language models (LLMs) across various domains, most focus on passive reasoning tasks—where models are provided with all the necessary information to solve a problem. In contrast, active reasoning tasks, which require models to interact with external systems to gather missing information, remain largely underexplored. To bridge this gap, we introduce ARBench, a benchmark specifically designed to systematically assess the active reasoning capabilities of LLMs. AR-Bench features three distinct tasks: detective cases, situation puzzles, and guessing numbers, which evaluate active reasoning across commonsense, logical, and symbolic reasoning scenarios. Empirical results reveal that modern LLMs struggle significantly with active reasoning, often failing to consistently acquire or utilize relevant information. This exposes a clear disparity between their passive and active reasoning abilities. Further analysis shows that even advanced techniques, including tree-of-thought and post-training methods, provide only marginal improvements and fail to achieve performance levels suitable for practical applications. These findings underscore the urgent need to improve LLMs' active reasoning capabilities to better align with real-world problem-solving requirements.
Paperid:2256
Authors:Binghui Peng · Aviad Rubinstein
Abstract:
Abstract:We revisit the problem of selecting $k$out-of-$n$ elements with the goal of optimizing an objective function, and ask whether it can be solved approximately with *sublinear query complexity*.- For objective functions that are monotone *submodular*, [Li, Feldman, Kazemi, Karbasi, NeurIPS'22; Kuhnle, AISTATS'21] gave an $\Omega(n/k)$ query lower bound for approximating to within any constant factor. We strengthen their lower bound to a nearly tight $\tilde{\Omega}(n)$. This lower bound holds even for estimating the value of the optimal subset. - When the objective function is *additive*, i.e. $f(S) = \sum_{i \in S} w_i$ for unknown $w_i\geq 0$, we prove that {\em finding} an approximately optimal subset still requires near-linear query complexity, but we can {\em estimate} the value of the optimal subset in $\tilde{O}(n/k)$ queries, and that this is tight up to polylog factors.
Paperid:2257
Authors:Jian Liu · Haohan Weng · Biwen Lei · Xianghui Yang · Zibo Zhao · Zhuo Chen · Chunchao Guo · Song Guo · Tao Han
Abstract:
The nextcoordinate prediction paradigm has emerged as the de facto standard in current auto-regressive mesh generation methods.Despite their effectiveness, there is no efficient measurement for the various tokenizers that serialize meshes into sequences. In this paper, we introduce a new metric Per-Token-Mesh-Entropy (PTME) to evaluate the existing mesh tokenizers theoretically without any training.Building upon PTME, we propose a plug-and-play tokenization technique called coordinate merging.It further improves the compression ratios of existing tokenizers by rearranging and merging the most frequent patterns of coordinates.Through experiments on various tokenization methods like MeshXL, MeshAnything V2, and Edgerunner, we further validate the performance of our method.We hope that the proposed PTME and coordinate merging can enhance the existing mesh tokenizers and guide the further development of native mesh generation.
Paperid:2258
Authors:Han Zhou · dr. Jordy Van Landeghem · Teodora Popordanoska · Matthew B Blaschko
Abstract:
Abstract:The selective classifier (SC) has been proposed for rank based uncertainty thresholding, which could have applications in safety critical areas such as medical diagnostics, autonomous driving, and the justice system. The Area Under the RiskCoverage Curve (AURC) has emerged as the foremost evaluation metric for assessing the performance of SC systems. In this work, we present a formal statistical formulation of population AURC, presenting an equivalent expression that can be interpreted as a reweighted risk function. Through Monte Carlo methods, we derive empirical AURC plug-in estimators for finite sample scenarios. The weight estimators associated with these plug-in estimators are shown to be consistent, with low bias and tightly bounded mean squared error (MSE). The plug-in estimators are proven to converge at a rate of $\mathcal{O}(\sqrt{\ln(n)/n})$ demonstrating statistical consistency. We empirically validate the effectiveness of our estimators through experiments across multiple datasets, model architectures, and confidence score functions (CSFs), demonstrating consistency and effectiveness in fine-tuning AURC performance.
Paperid:2259
Authors:Piyush Lalitkumar Tiwary · Kinjawl Bhattacharyya · Prathosh AP
Abstract:
Medical image segmentation models often struggle to generalize across different domains due to various reasons. Domain Generalization (DG) methods overcome this either through representation learning or data augmentation (DA). While representation learning methods seek domaininvariant features, they often rely on ad-hoc techniques and lack formal guarantees. DA methods, which enrich model representations through synthetic samples, have shown comparable or superior performance to representation learning approaches. We propose LangDAug, a novelLangevinDataAugmentation for multi-source domain generalization in 2D medical image segmentation. LangDAug leverages Energy-Based Models (EBMs) trained via contrastive divergence to traverse between source domains, generating intermediate samples through Langevin dynamics. Theoretical analysis shows that LangDAug induces a regularization effect, and for GLMs, it upper-bounds the Rademacher complexity by the intrinsic dimensionality of the data manifold. Through extensive experiments on Fundus segmentation and 2D MRI prostate segmentation benchmarks, we show that LangDAug outperforms state-of-the-art domain generalization methods and effectively complements existing domain-randomization approaches.
Paperid:2260
Authors:Jun-Qi Guo · Meng-Zhang Qian · Wei Gao · Zhi-Hua Zhou
Abstract:
Ensemble methods have attracted much attention in adversarial robustness learning during the past years, and diversity has always been one of the most crucial factors on the design of adversarial ensemble methods. This work tries to focus on the fundamental problems: How to define the diversity for adversarial ensemble, and how to correlate it with algorithmic performance.We first show that it is an NPHard problem to precisely calculate the diversity of two networks in adversarial ensemble learning, which makes it different from traditional diversity analysis. We present the first diversity decomposition in the adversarial ensemble learning, following previous works with the first-order approximation. More precisely, the adversarial ensemble loss can be decomposed into average of individual adversarial losses, prediction diversity, gradient diversity and cross diversity. Hence, it is not sufficient to only consider gradient diversity on the characterization of diversity as in previous adversarial ensemble methods. We present similar diversity decomposition for classification with cross-entropy loss for neural network. Based on the theoretical analysis, we develop new ensemble method via orthogonal adversarial predictions to improve gradient and cross diversity simultaneously. We finally validate the effectiveness of our method empirically.
Paperid:2261
Authors:Chuheng Zhang · Wei Shen · Li Zhao · Xuyun Zhang · Xiaolong Xu · Wanchun Dou · Jiang Bian
Abstract:
Abstract:While direct policy optimization methods exist, pioneering LLMs are finetuned with reinforcement learning from human feedback (RLHF) to generate better responses under the supervision of a reward model learned from preference data. One major challenge of RLHF is the inaccuracy of the intermediate reward model, especially in the tasks that requires complex reasoning for the reward model to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve the signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtering strategy, we use the coefficient of determination ($R^2$) between the rewards and actual scores on filtered samples as the metrics to help us find promising strategies since it measures how well the rewards filtered by PF-PPO indicate real performance. We provide extensive experiments to validate the effectiveness of PF-PPO in code generation and math reasoning tasks. In code generation, PF-PPO achieves the state-of-the-art performance of 7-billion-parameter models on HumanEval (+7.9%), MBPP (+0.7%), and LeetCode Contest (+10.0%) which is a more challenging benchmark created by us. In math reasoning, PF-PPO yields performance increase using different reward models and benchmarks (Ape210K and CMATH).
Paperid:2262
Authors:Andrew Draganov · Sharvaree Vadgama · Sebastian Damrich · Jan Böhm · Lucas Maes · Dmitry Kobak · Erik Bekkers
Abstract:
Selfsupervised learning (SSL) allows training data representations without a supervised signal and has become an important paradigm in machine learning. Most SSL methods employ the cosine similarity between embedding vectors and hence effectively embed data on a hypersphere. While this seemingly implies that embedding norms cannot play any role in SSL, a few recent works have suggested that embedding norms have properties related to network convergence and confidence. In this paper, we resolve this apparent contradiction and systematically establish the embedding norm's role in SSL training. Using theoretical analysis, simulations, and experiments, we show that embedding norms (i) govern SSL convergence rates and (ii) encode network confidence, with smaller norms corresponding to unexpected samples. Additionally, we show that manipulating embedding norms can have large effects on convergence speed.Our findings demonstrate that SSL embedding norms are integral to understanding and optimizing network behavior.
Paperid:2263
Authors:Han Shao · Shuo Xie · Kunhe Yang
Abstract:
Abstract:Strategic classification addresses a learning problem where a decisionmaker implements a classifier over agents who may manipulate their features in order to receive favorable predictions. In the standard model of online strategic classification, in each round, the decision-maker implements and publicly reveals a classifier, after which agents perfectly best respond based on this knowledge. However, in practice, whether to disclose the classifier is often debated---some decision-makers believe that hiding the classifier can prevent misclassification errors caused by manipulation.In this paper, we formally examine how limiting the agents' access to the current classifier affects the decision-maker's performance. Specifically, we consider an extended online strategic classification setting where agents lack direct knowledge about the current classifier and instead manipulate based on a weighted average of historically implemented classifiers. This model captures a wide range of agent behaviors with varying degrees of memory of past classifiers. Our main result shows that, contrary to the conventional wisdom, restricting agents' access to the current classifier can *significantly increase* the decision-maker's mistakes compared to the full-knowledge setting. In particular, we prove that the decision-maker incurs $\Omega(\max\\{k_{\text{in}},1/\log(\frac{1}{\gamma}) \\})$ times more mistakes, where $k_{\text{in}}$ is the maximum in-degree of the manipulation graph (representing how many distinct feature vectors can be manipulated to appear as a single one), and $\gamma$ is the discount factor indicating agents' memory of past classifiers. Our results demonstrate how withholding access to the classifier can backfire and degrade the decision-maker's performance in online strategic classification.
Paperid:2264
Authors:Lan-Cuong Nguyen · Quan Nguyen-Tri · Bang Khanh · Dung D. Le · Long Tran-Thanh · Khoat Than
Abstract:
Fewshot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising approach to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.
Paperid:2265
Authors:Kanghee Park · Timothy Zhou · Loris D'Antoni
Abstract:
Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammarconstrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can ``align'' with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.
Paperid:2266
Authors:Junyi Liao · Zihan Zhu · Ethan Fang · Zhuoran Yang · Vahid Tarokh
Abstract:
Estimating the unknown reward functions driving agents' behavior is a central challenge in inverse games and reinforcement learning. This paper introduces a unified framework for reward function recovery in twoplayer zero-sum matrix games and Markov games with entropy regularization. Given observed player strategies and actions, we aim to reconstruct the underlying reward functions. This task is challenging due to the inherent ambiguity of inverse problems, the non-uniqueness of feasible rewards, and limited observational data coverage. To address these challenges, we establish reward function identifiability using the quantal response equilibrium (QRE) under linear assumptions. Building on this theoretical foundation, we propose an algorithm to learn reward from observed actions, designed to capture all plausible reward parameters by constructing confidence sets. Our algorithm works in both static and dynamic settings and is adaptable to incorporate other methods, such as Maximum Likelihood Estimation (MLE). We provide strong theoretical guarantees for the reliability and sample-efficiency of our algorithm. Empirical results demonstrate the framework’s effectiveness in accurately recovering reward functions across various scenarios, offering new insights into decision-making in competitive environments.
Paperid:2267
Authors:Gül Sena Altıntaş · Devin Kwok · Colin Raffel · David Rolnick
Abstract:
Abstract:Neural network training is inherently sensitive to initialization and the randomness induced by stochastic gradient descent. However, it is unclear to what extent such effects lead to meaningfully different networks in terms of either the models' weights or the underlying functions that were learned. In this work, we show that during an initial ``chaotic'' phase, an extremely small perturbation reliably causes otherwise identical training trajectories to diverge an effect that diminishes rapidly over training time. We quantify this divergence through (i) $L^2$ distance between parameters, (ii) the similarity of parameter vectors as measured by permutation alignment, and, most importantly, (iii) the loss barrier when interpolating between networks; revealing how perturbations across different hyperparameter or fine-tuning settings drive training trajectories toward distinct loss minima. Our findings provide insights into neural network training stability, with practical implications for fine-tuning and model merging techniques.
Paperid:2268
Authors:Joonkyu Kim · Yejin Kim · Jy-yong Sohn
Abstract:
In continual learning scenarios, catastrophic forgetting of previously learned tasks is a critical issue, making it essential to effectively measure such forgetting. Recently, there has been growing interest in focusing on representation forgetting, the forgetting measured at the hidden layer. In this paper, we provide the first theoretical analysis of representation forgetting and use this analysis to better understand the behavior of continual learning. First, we introduce a new metric called representation discrepancy, which measures the difference between representation spaces constructed by two snapshots of a model trained through continual learning. We demonstrate that our proposed metric serves as an effective surrogate for the representation forgetting while remaining analytically tractable. Second, through mathematical analysis of our metric, we derive several key findings about the dynamics of representation forgetting: the forgetting occurs more rapidly to a higher degree as the layer index increases, while increasing the width of the network slows down the forgetting process. Third, we support our theoretical findings through experiments on real image datasets, including SplitCIFAR100 and ImageNet1K.
Paperid:2269
Authors:Yujia Yin · Tianyi Qu · Zihao Wang · Yifan Chen
Abstract:
Through recognizing causal subgraphs, causal graph learning (CGL) has risen to be a promising approach for improving the generalizability of graph neural networks under outof-distribution (OOD) scenarios. However, the empirical successes of CGL techniques are mostly exemplified in classification settings, while regression tasks, a more challenging setting in graph learning, are overlooked. We thus devote this work to tackling causal graph regression (CGR); to this end we reshape the processing of confounding effects in existing CGL studies, which mainly deal with classification. Specifically, we reflect on the predictive power of confounders in graph-level regression, and generalize classification-specific causal intervention techniques to regression through a lens of contrastive learning. Extensive experiments on graph OOD benchmarks validate the efficacy of our proposals for CGR. The model implementation and the code are provided in the supplementary materials and scheduled to be open-sourced upon publication.
Paperid:2270
Authors:Yuanhe Zhang · Fanghui Liu · Yudong Chen
Abstract:
This paper studies how to improve the performance of LowRank Adaption (LoRA) as guided by our theoretical analysis. Our first set of theoretical results show that for random initialization and linear models, i) LoRA will algin to the certain singular subspace of one-step gradient of full fine-tuning; ii) preconditioners improve convergence in the high-rank case. These insights motivate us to focus on preconditioned LoRA using a specific spectral initialization strategy for aligning with certain subspaces. For both linear and nonlinear models, we prove that alignment and generalization guarantees can be directly achieved at initialization, and the subsequent linear convergence can be also built. Our analysis leads to theLoRA-Onealgorithm (usingOne-step gradient and preconditioning), a theoretically grounded algorithm that achieves significant empirical improvement over vanilla LoRA and its variants on several benchmarks. Our theoretical analysis, based on decoupling the learning dynamics and characterizing how spectral initialization contributes to feature learning, may be of independent interest for understanding matrix sensing and deep learning theory. The source code can be found in the supplemental materials.
Paperid:2271
Authors:Felipe Areces · Christopher Mohri · Tatsunori Hashimoto · John Duchi
Abstract:
Abstract:We introduce a family of algorithms for online conformal prediction with coverage guarantees for both adversarial and stochastic data. In the adversarial setting, we establish the standard guarantee: over time, a prespecified target fraction of confidence sets cover the ground truth. For stochastic data, we provide a guarantee at every time instead of just on average over time: the probability that a confidence set covers the ground truth—conditioned on past observations—converges to a pre-specified target when the conditional quantiles of the errors are a linear function of past data. Complementary to our theory, our experiments spanning over $15$ datasets suggest that the performance improvement of our methods over baselines grows with the magnitude of the data’s dependence, even when baselines are tuned on the test set. We put these findings to the test by pre-registering an experiment for electricity demand forecasting in Texas, where our algorithms achieve over a $10$\% reduction in confidence set sizes, a more than a $30$\% improvement in quantile and absolute losses with respect to the observed errors, and significant outcomes on all $78$ out of $78$ pre-registered hypotheses.
Paperid:2272
Authors:Leshem Choshen · Yang Zhang · Jacob Andreas
Abstract:
Scaling laws predict the loss of a target machine learning model by extrapolatingfrom easierto-train models with fewer parameters or smaller training sets. Thisprovides an efficient way for practitioners and researchers alike to compare pre-training decisions involving optimizers, datasets, and model architectures. Despitethe widespread use of scaling laws to model the dynamics of language modeltraining, there has been little work on understanding how to best estimate andinterpret them. We collect (and release) a large-scale dataset containing losses anddownstream evaluations for 485 previously published pretrained models. We usethese to estimate more than 1000 scaling laws, then derive a set of best practicesfor estimating scaling laws in new model families. We find that fitting scaling lawsto intermediate checkpoints of training runs (and not just their final losses) substan-tially improves accuracy, and that—all else equal—estimates of performance aregenerally most accurate when derived from other models of similar sizes. However,because there is a significant degree of variability across model seeds, trainingmultiple small models is sometimes more useful than training a single large one.Moreover, while different model families differ in scaling behavior, they are oftensimilar enough that a target model’s behavior can be predicted from a single modelwith the same architecture, along with scaling parameter estimates derived fromother model families.
Paperid:2273
Authors:Kang-Jun Liu · Masanori Suganuma · Takayuki Okatani
Abstract:
We present a novel selfsupervised feature learning method using Vision Transformers (ViT) as the backbone, specifically designed for object detection and instance segmentation. Our approach addresses the challenge of extracting features that capture both class and positional information, which are crucial for these tasks. The method introduces two key components: (1) a positional encoding tied to the cropping process in contrastive learning, which utilizes a novel vector field representation for positional embeddings; and (2) masking and prediction, similar to conventional Masked Image Modeling (MIM), applied in parallel to both content and positional embeddings of image patches. These components enable the effective learning of intertwined content and positional features. We evaluate our method against state-of-the-art approaches, pre-training on ImageNet-1K and fine-tuning on downstream tasks. Our method outperforms the state-of-the-art SSL methods on the COCO object detection benchmark, achieving significant improvements with fewer pre-training epochs. These results suggest that better integration of positional information into self-supervised learning can improve performance on the dense prediction tasks.
Paperid:2274
Authors:Akhil Jalan · Yassir Jedra · Arya Mazumdar · Soumendu Sundar Mukherjee · Purnamrita Sarkar
Abstract:
Abstract:We study transfer learning for matrix completion in a Missing Notat-Random (MNAR) setting that is motivated by biological problems. The target matrix $Q$ has entire rows and columns missing, making estimation impossible without side information. To address this, we usea noisy and incomplete source matrix $P$, which relates to $Q$ via a feature shift in latent space. We consider both the *active* and *passive* sampling of rows and columns. We establish minimax lower bounds for entrywise estimation error in each setting. Our computationally efficient estimation framework achieves this lower bound for the active setting, which leverages the source data to query the most informative rows and columns of $Q$. This avoids the need for *incoherence* assumptions required for rate optimality in the passive sampling setting. We demonstrate the effectiveness of our approach through comparisons with existing algorithms on real-world biological datasets.
Paperid:2275
Authors:Jake Tuero · Michael Buro · Levi Lelis
Abstract:
Policy tree search is a family of tree search algorithms that use a policy to guide the search. The novelty of these algorithms is that they provide guarantees on the number of expansions to solve a given problem that is related to the quality of the policy. While these algorithms have shown promising results, the process in which they are trained requires complete solution trajectories to train the policy. Search trajectories are obtained during a trialand-error search process. When the training problem instances are hard, the learning can be prohibitively costly when starting from a randomly initialized policy. As a result, search samples are wasted in failed attempts to solve these hard instances. In this paper, we introduce a novel method for learning subgoal-based policies for policy tree search algorithms. The subgoals and policies conditioned on subgoals are learned from the trees the search expands while attempting to solve problems, including the search trees of failed attempts. We empirically show that our policy formulation and training method improves the sample efficiency of learning a policy and heuristic function in this online setting.
Paperid:2276
Authors:Yuyang Hu · Albert Peng · Weijie Gan · Peyman Milanfar · Mauricio Delbracio · Ulugbek Kamilov
Abstract:
Deep neural networks trained as image denoisers are widely used as priors for solving imaging inverse problems. We introduce Stochastic deep Restoration Priors (ShaRP), a novel framework that stochastically leverages an ensemble of deep restoration models beyond denoisers to regularize inverse problems. By using generalized restoration models trained on a broad range of degradations beyond simple Gaussian noise, ShaRP effectively addresses structured artifacts and enables selfsupervised training without fully sampled data. We prove that ShaRP minimizes an objective function involving a regularizer derived from the score functions of minimum mean square error (MMSE) restoration operators. We also provide theoretical guarantees for learning restoration operators from incomplete measurements. ShaRP achieves state-of-the-art performance on tasks such as magnetic resonance imaging reconstruction and single-image super-resolution, surpassing both denoiser- and diffusion-model-based methods without requiring retraining.
Paperid:2277
Authors:Wenwen He · Wenke Huang · Bin Yang · ShuKan Liu · Mang Ye
Abstract:
Federated Learning (FL) is a distributed learning paradigm that facilitates collaborative user updates while ensuring the privacy of individual data. Despite its advantages, FL is susceptible to backdoor attacks, wherein malicious participants can inject harmful updates into the target model, resulting in a degraded performance of the global model with specific inputs. Existing defense mechanisms are predicated on two assumptions: the isolated nature of individual behaviors and passive purification through welldesigned defense rules. However, malicious clients can adapt their adversarial strategies to circumvent these established defense protocols. To address the individual isolation protocol, we propose a marginal collaboration defense mechanism, called SPMC, which leverages intrinsic consistency across clients to determine inter-client margin contributions, adaptively reducing the impact of change in the number of attackers. As for the proxy-dependent protocol, we self-purify the parameter distribution of potential malicious actors, enhancing the performance of the model in the direction of benign updates. Specifically, we assert that even if an attacker is reassigned within the aggregation model, they can still exert a local influence on the global model through harmful gradients. In response, we locally update the gradients to align with the marginal knowledge, thereby mitigating the risk of local poisoning. With the above two modules, our approach enhances the flexibility of defense mechanisms on the server side compared to the client side, thereby improving the robustness of the federated system. Experimental results across various image classification tasks demonstrate the effectiveness of SPMC.
Paperid:2278
Authors:Xuandong Zhao · Will Cai · David Huang · Tianneng Shi · Licong Lin · Song Mei · Dawn Song
Abstract:
Existing trainingtime safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment.
Paperid:2279
Authors:Velibor Bojkovic · Xiaofeng Wu · Bin Gu
Abstract:
Spiking Neural Networks (SNNs) offer a more energyefficient alternative to Artificial Neural Networks (ANNs) by mimicking biological neural principles, establishing them as a promising approach to mitigate the increasing energy demands of large-scale neural models. However, fully harnessing the capabilities of SNNs remains challenging due to their discrete signal processing and temporal dynamics. ANN-SNN conversion has emerged as a practical approach, enabling SNNs to achieve competitive performance on complex machine learning tasks. In this work, we identify a phenomenon in the ANN-SNN conversion framework, termedtemporal misalignment, in which random spike rearrangement across SNN layers leads to performance improvements. Based on this observation, we introduce biologically plausible two-phase probabilistic (TPP) spiking neurons, further enhancing the conversion process. We demonstrate the advantages of our proposed method both theoretically and empirically through comprehensive experiments on CIFAR-10/100, CIFAR10-DVS, and ImageNet across a variety of architectures, achieving state-of-the-art results.
Paperid:2280
Authors:Zhuo He · Shuang Li · Wenze Song · Longhui Yuan · Jian Liang · Han Li · Kun Gai
Abstract:
Endowing deep models with the ability to generalize in dynamic scenarios is of vital significance for realworld deployment, given the continuous and complex changes in data distribution. Recently, evolving domain generalization (EDG) has emerged to address distribution shifts over time, aiming to capture evolving patterns for improved model generalization. However, existing EDG methods may suffer from spurious correlations by modeling only the dependence between data and targets across domains, creating a shortcut between task-irrelevant factors and the target, which hinders generalization. To this end, we design a time-aware structural causal model (SCM) that incorporates dynamic causal factors and the causal mechanism drifts, and proposeStatic-DYNamicCausal Representation Learning (SYNC), an approach that effectively learns time-aware causal representations. Specifically, it integrates specially designed information-theoretic objectives into a sequential VAE framework which captures evolving patterns, and produces the desired representations by preserving intra-class compactness of causal factors both across and within domains. Moreover, we theoretically show that our method can yield the optimal causal predictor for each time domain. Results on both synthetic and real-world datasets exhibit thatSYNCcan achieve superior temporal generalization performance.
Paperid:2281
Authors:Freya Behrens · Luca Biggio · Lenka Zdeborova
Abstract:
Next to scaling considerations, architectural design choices profoundly shape the solution space of transformers. In this work, we analyze the solutions simple transformer blocks implement when tackling the histogram task: counting items in sequences. Despite its simplicity, this task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, tokenmixing mechanisms, and feed-forward layer capacity. We identify two theoretical counting strategies transformers adopt, relation-based and inventory-based counting, each defining distinct learning regimes for the task. These strategies dictate how functionality is distributed between attention and feed-forward layers. We further show that adding softmax and beginning-of-sequence tokens allow for more robustness when embedding dimensions are comparatively small. Empirical introspection of trained models closely confirms both the learning regimes of the various architectures and the formation of these strategies during training. We demonstrate how a basic task that requires only aggregation and selection is significantly impacted by minor design changes.
Paperid:2282
Authors:Tianle Li · Wei-Lin Chiang · Evan Frick · Lisa Dunlap · Tianhao Wu · Banghua Zhu · Joseph E Gonzalez · Ion Stoica
Abstract:
The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of highquality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce Bench-O-Matic, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply Bench-O-Matic to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark’s alignment with human preferences and ability to separate models. We release Eval-O-Matic, a benchmark consisting 500 challenging prompts curated by Bench-O-Matic. Eval-O-Matic provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings, all at a cost of $20. Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.
Paperid:2283
Authors:Shuqi Gu · Chuyue Li · Baoyu Jing · Kan Ren
Abstract:
Time series synthesis has become a foundational task in modern society, underpinning decisionmaking across various scenes. Recent approaches primarily generate time series from structured conditions, such as attribute-based metadata. However, these methods struggle to capture the full complexity of time series, as the predefined structures often fail to reflect intricate temporal dynamics or other nuanced characteristics. Moreover, constructing structured metadata requires expert knowledge, making large-scale data labeling costly and impractical. In this paper, we introduce VerbalTS, a novel framework for generating time series from unstructured textual descriptions, offering a more expressive and flexible solution to time series synthesis. To bridge the gap between unstructured text and time series data, VerbalTS employs a multi-focal alignment and generation framework, effectively modeling their complex relationships. Experiments on two synthetic and four real-world datasets demonstrate that VerbalTS outperforms existing methods in both generation quality and semantic alignment with textual conditions.
Paperid:2284
Authors:Xujie Song · Liangfa Chen · Tong Liu · Wenxuan Wang · Yinuo Wang · Shentao Qin · Yinsong Ma · Jingliang Duan · Shengbo Li
Abstract:
Deep reinforcement learning (RL) is an effective approach for decisionmaking and control tasks. However, RL-trained policy networks often suffer from the action fluctuation problem in real-world applications, resulting in severe actuator wear, safety risk, and performance degradation. In this paper, we identified the two fundamental causes of action fluctuation: observation noise and policy non-smoothness. Then, we propose a novel policy network, LipsNet++, integrating a Fourier filter layer and a Lipschitz controller layer to mitigate these two causes decoupledly. The filter layer incorporates a trainable filter matrix that automatically extracts important frequencies while suppressing noise frequencies in the observations. The controller layer introduces a Jacobian regularization technique to achieve a low Lipschitz constant, ensuring smooth fitting of a policy function. These two layers function analogously to the filter and controller in classical control theory, suggesting that filtering and control capabilities can be seamlessly integrated into a single policy network. Both simulated and real-world experiments demonstrate that LipsNet++ achieves the state-of-the-art noise robustness and action smoothness.
Paperid:2285
Authors:Deepak Ravikumar · Efstathia Soufleri · Abolfazl Hashemi · Kaushik Roy
Abstract:
Deep learning has become the de facto approach in nearly all learning tasks. It has been observed that deep models tend to memorize and sometimes overfit data, which can lead to compromises in performance, privacy, and other critical metrics.In this paper, we explore the theoretical foundations that connect memorization to sample loss, focusing on learning dynamics to understand what and how deep models memorize. To this end, we introduce a novel proxy for memorization: Cumulative Sample Loss (CSL).CSL represents the accumulated loss of a sample throughout the training process.CSL exhibits remarkable similarity to stabilitybased memorization, as evidenced by considerably high cosine similarity scores. We delve into the theory behind these results, demonstrating that low CSL leads to nontrivial bounds on the extent of stability-based memorization and learning time. The proposed proxy, CSL, is four orders of magnitude less computationally expensive than the stability-based method and can be obtained with zero additional overhead during training. We demonstrate the practical utility of the proposed proxy in identifying mislabeled samples and detecting duplicates where our metric achieves state-of-the-art performance.
Paperid:2286
Authors:Quansong He · Xiangde Min · Kaishen Wang · Tao He
Abstract:
Medical image segmentation is a critical task in computer vision, with UNet serving as a milestone architecture. However, UNet's skip connections face two significant limitations: (1) they lack effective interaction between features at different scales, and (2) they rely on simple concatenation or addition operations, which constrain efficient information integration. While recent improvements to UNet have focused on enhancing encoder and decoder capabilities, these limitations remain overlooked. To overcome these challenges, we propose a novel multiscale feature fusion method that reimagines the UNet decoding process as solving an initial value problem (IVP), treating skip connections as discrete nodes. By leveraging principles from the linear multi-step method, we propose an adaptive ordinary differential equation method to enable effective multi-scale feature fusion. Our approach is independent of encoder and decoder designs, making it applicable to any U-like network. Experiments on ACDC, KiTS2023, MSD brain tumor, and ISIC2017/2018 skin lesion segmentation datasets demonstrate improved feature utilization, reduced network parameters, and maintained high performance. The code will be made available on GitHub.
Paperid:2287
Authors:Thore Gerlach · Loong Kuan Lee · Frederic BARBARESCO · Nico Piatkowski
Abstract:
MultiAgent Path Finding (MAPF) focuses on determining conflict-free paths for multiple agents navigating through a shared space to reach specified goal locations. This problem becomes computationally challenging, particularly when handling large numbers of agents, as frequently encountered in practical applications like coordinating autonomous vehicles. Quantum computing (QC) is a promising candidate in overcoming such limits. However, current quantum hardware is still in its infancy and thus limited in terms of computing power and error robustness. In this work, we present the first optimal hybrid quantum-classical MAPF algorithms which are based on branch-and-cut-and-prize. QC is integrated by iteratively solving QUBO problems, based on conflict graphs. Experiments on actual quantum hardware and results on benchmark data suggest that our approach dominates previous QUBO formulations and baseline MAPF solvers.
Paperid:2288
Authors:Sebastian Towers · Aleksandra Kalisz · Philippe Robert · Alicia Higueruelo · Francesca Vianello · Chloe Tsai · Harrison Steel · Jakob Foerster
Abstract:
Antiviral therapies are typically designed to target only the current strains of a virus, amyopicresponse. However, therapy-induced selective pressures drive the emergence of new viral strains, against which the original myopic therapies are no longer effective. This evolutionary response presents an opportunity: our therapies could bothdefend against and actively influence viral evolution. This motivates our method ADIOS: Antibody Development vIa Opponent Shaping. ADIOS is a meta-learning framework where the process of antibody therapy design, theouter loop, accounts for the virus's adaptive response, theinner loop. With ADIOS, antibodies are not only robust against potential future variants, they also influence, i.e.shape, which future variants emerge. In line with the opponent shaping literature, we refer to our optimised antibodies asshapers. To demonstrate the value of ADIOS, we build a viral evolution simulator using the Absolut! framework, in which shapers successfully target both current and future viral variants, outperforming myopic antibodies. Furthermore, we show that shapers modify the distribution over viral evolutionary trajectories to result in weaker variants.We believe that our ADIOS paradigm will facilitate the discovery of long-lived vaccines and antibody therapies while also generalising to other domains. Specifically, domains such as antimicrobial resistance and cancer treatment, and others with evolutionarily adaptive opponents. Our code is available at https://anonymous.4open.science/r/adios-2AFE/.
Paperid:2289
Authors:Seohong Park · Qiyang Li · Sergey Levine
Abstract:
We present flow Qlearning (FQL), a simple and performant offline reinforcement learning (RL) method that leverages an expressive flow-matching policy to model arbitrarily complex action distributions in data. Training a flow policy with RL is a tricky problem, due to the iterative nature of the action generation process. We address this challenge by training an expressive one-step policy with RL, rather than directly guiding an iterative flow policy to maximize values. This way, we can completely avoid unstable recursive backpropagation, eliminate costly iterative action generation at test time, yet still mostly maintain expressivity. We experimentally show that FQL leads to strong performance across 73 challenging state- and pixel-based OGBench and D4RL tasks in offline RL and offline-to-online RL.
Paperid:2290
Authors:Ngoc Bui · Menglin Yang · Runjin Chen · Leonardo Neves · Mingxuan Ju · ZHITAO YING · Neil Shah · Tong Zhao
Abstract:
Backward compatible representation learning enables updated models to integrate seamlessly with existing ones, avoiding to reprocess stored data. Despite recent advances, existing compatibility approaches in Euclidean space neglect the uncertainty in the old embedding models and force the new model to replicate outdated representations regardless of their quality, and thereby hindering the learning process. In this paper, we switch perspectives to hyperbolic geometry, where we treat time as a natural axis for capturing a model’s confidence and evolution. By lifting embeddings into hyperbolic space and constraining updated embeddings to lie within the entailment cone of the old ones, we maintain generational consistency across models while accounting for uncertainties in the representations. To further enhance compatibility, we introduce a robust contrastive alignment loss that dynamically adjusts alignment weights based on the uncertainty of the old embeddings. Experiments validate the superiority of the proposed method in achieving compatibility, paving the way for more resilient and adaptable machine learning systems.
Paperid:2291
Authors:Xiangzhe Kong · Zishen Zhang · Ziting Zhang · Rui Jiao · Jianzhu Ma · Kai Liu · Wenbing Huang · Yang Liu
Abstract:
The design of targetspecific molecules such as small molecules, peptides, and antibodies is vital for biological research and drug discovery. Existing generative methods are restricted to single-domain molecules, failing to address versatile therapeutic needs or utilize cross-domain transferability to enhance model performance. In this paper, we introduceUnified generativeModelingof 3DMolecules (UniMoMo), the first framework capable of designing binders of multiple molecular domains using a single model. In particular, UniMoMo unifies the representations of different molecules as graphs of blocks, where each block corresponds to either a standard amino acid or a molecular fragment. Based on these unified representations, UniMoMo utilizes a geometric latent diffusion model for 3D molecular generation, featuring an iterative full-atom autoencoder to compress blocks into latent space points, followed by an E(3)-equivariant diffusion process. Extensive benchmarks across peptides, antibodies, and small molecules demonstrate the superiority of our unified framework over existing domain-specific models, highlighting the benefits of multi-domain training.
Paperid:2292
Authors:Chuan Liu · Chunshu Wu · Ruibing Song · Ang Li · Ying Nian Wu · Tong Geng
Abstract:
Learning equations is fundamental to human understanding of the world. While modern machine learning (ML) methods are powerful equation learners, their escalating complexity and high operational costs hinder sustainable development. In contrast, natural dynamical systems effortlessly learn complex equations and solve them by instinctively evolving to lowenergy states. However, existing dynamical system approaches are limited by low expressivity and a lack of training support.To this end, we propose EADS, an Expressive and self-Adaptive Dynamical System capable of accurately learning a wide spectrum of equations with extraordinary efficiency. Specifically, (1) drawing inspiration from biological dynamical systems, we integrate hierarchical architectures and heterogeneous dynamics into EADS, significantly enhancing its capacity to model complex equations. (2) We propose an efficient on-device learning method that leverages intrinsic electrical signals to update parameters, making EADS self-adaptive at a negligible cost. Experimental results across key equations from diverse domains demonstrate that EADS achieves higher accuracy than existing works, while offering orders-of-magnitude speedups and energy efficiency over traditional neural network solutions on GPUs for both inference and training, showcasing its broader impact in overcoming computational bottlenecks across various fields.
Paperid:2293
Authors:Ali Bahri · Moslem Yazdanpanah · Sahar Dastani · Mehrdad Noori · Gustavo Vargas Hakim · David OSOWIECHI · Farzad Beizaee · Ismail Ben Ayed · Christian Desrosiers
Abstract:
Test Time Training has emerged as a promising solution to address distribution shifts in 3D point cloud classification. However, existing methods often rely on computationally expensive backpropagation during adaptation, limiting their applicability in realworld, time-sensitive scenarios. In this paper, we introduce SMART-PC, a skeleton-based framework that enhances resilience to corruptions by leveraging the geometric structure of 3D point clouds. During pre-training, our method predicts skeletal representations, enabling the model to extract robust and meaningful geometric features that are less sensitive to corruptions, thereby improving adaptability to test-time distribution shifts.Unlike prior approaches, SMART-PC achieves real-time adaptation by eliminating backpropagation and updating only BatchNorm statistics, resulting in a lightweight and efficient framework capable of achieving high frame-per-second rates while maintaining superior classification performance. Extensive experiments on benchmark datasets, including ModelNet40-C, ShapeNet-C, and ScanObjectNN-C, demonstrate that SMART-PC achieves state-of-the-art results, outperforming existing methods such as MATE in terms of both accuracy and computational efficiency.
Paperid:2294
Authors:Yuwei Liu · Yuwei Liu · Jiaming Zhuo · Di Jin · Chuan Wang · Zhen Wang · Xiaochun Cao
Abstract:
Brain network analysis plays a critical role in brain disease prediction and diagnosis. Graph mining tools have made remarkable progress. Graph neural networks (GNNs) and Transformers, which rely on the messagepassing scheme, recently dominated this field due to their powerful expressive ability on graph data. Unfortunately, by considering brain network construction using pairwise Pearson's coefficients between any pairs of ROIs, model analysis and experimental verification reveal that \textit{the message-passing under both GNNs and Transformers can NOT be fully explored and exploited}.Surprisingly, this paper observes the significant performance enhancements of the Hadamard product compared to the matrix product, which is the matrix form of message passing, in processing the brain network. Inspired by this finding, a novel Brain Quadratic Network (BQN) is proposed by incorporating quadratic networks, which possess better universal approximation properties. Theoretical analysis demonstrates that BQN implicitly performs community detection along with representation learning. Extensive evaluations verify the superiority of the proposed BQN compared to the message-passing-based brain network modeling.
Paperid:2295
Authors:Jian Gao · Weidong Cao · Xuan Zhang
Abstract:
Abstract:The sustainable performance improvements of integrated circuits (ICs) drive the continuous advancement of nearly all transformative technologies. Since its invention, IC performance enhancements have been dominated by scaling the semiconductor technology. Yet, as Moore's law tapers off, a crucial question arises: ***How can we sustain IC performance in the postMoore era?*** Creating new circuit topologies has emerged as a promising pathway to address this fundamental need. This work proposes **AnalogGenie-Lite**, a decoder-only **Gen**erat**i**ve **e**ngine to discover novel circuit topologies for **analog** ICs with significantly enhanced scalability and precision through **li**ghtw**e**igh**t** graph modeling. AnalogGenie-Lite makes several unique contributions, including concise device-pin representations (i.e., advancing the best prior art from $O\left(n^2\right)$ to $O\left(n\right)$), frequent sub-graph mining, and optimal sequence modeling. Thus, it remarkably pushes the frontier of generative-AI-based methods for analog circuit topology discovery, i.e., $5.15\times$ to $71.11\times$ in scalability and 23.5\% to 33.6\% improvement in validity, compared to the state-of-the-art. Case studies on social networks and protein graphs are also provided to show the broader applicability of the proposed graph modeling approach. Source code: https://github.com.
Paperid:2296
Authors:Yiming Qin · Manuel Madeira · Dorina Thanou · Pascal Frossard
Abstract:
Graph generative models are essential across diverse scientific domains by capturing complex distributions over relational data. Among them, graph diffusion models achieve superior performance but face inefficient sampling and limited flexibility due to the tight coupling between training and sampling stages. We introduce DeFoG, a novel graph generative framework that disentangles sampling from training, enabling a broader design space for more effective and efficient model optimization. DeFoG employs a discrete flowmatching formulation that respects the inherent symmetries of graphs. We theoretically ground this disentangled formulation by explicitly relating the training loss to the sampling algorithm and showing that DeFoG faithfully replicates the ground truth graph distribution. Building on these foundations, we thoroughly investigate DeFoG's design space and propose novel sampling methods that significantly enhance performance and reduce the required number of refinement steps. Extensive experiments demonstrate state-of-the-art performance across synthetic, molecular, and digital pathology datasets, covering both unconditional and conditional generation settings. It also outperforms most diffusion-based models with just 5–10\% of their sampling steps.
Paperid:2297
Authors:Bowen Deng · Lele Fu · Jialong Chen · Sheng Huang · Tianchi Liao · Zhang Tao · Chuan Chen
Abstract:
Generalized Category Discovery (GCD) aims to identify both known and novel categories in unlabeled data by leveraging knowledge from old classes. However, existing methods are limited to nongraph data; lack theoretical foundations to answerWhen and how known classes can help GCD. We introduce the Graph GCD task; provide the first rigorous theoretical analysis ofparametric GCD. By quantifying the relationship between old and new classes in the embedding space using the Wasserstein distance W, we derive the first provable GCD loss bound based on W. This analysis highlights two necessary conditions for effective GCD. However, we uncover, through a Pairwise Markov Random Field perspective, that popular graph contrastive learning (GCL) methods inherently violate these conditions. To address this limitation, we propose SWIRL, a novel GCL method for GCD. Experimental results validate our (theoretical) findings and demonstrate SWIRL's effectiveness.
Paperid:2298
Authors:Jiale Ma · Wenzheng Pan · Yang Li · Junchi Yan
Abstract:
Despite rapid progress in neural combinatorial optimization (NCO) for solving CO problems (COPs), as the problem scale grows, several bottlenecks persist: 1) solvers in the Global Prediction (GP) paradigm struggle in longrange decisions where the overly smooth intermediate heatmaps impede effective decoding, and 2) solvers in the Local Construction (LC) paradigm are time-consuming and incapable of tackling large instances due to the onerous auto-regressive process. Observing these challenges, we propose a new paradigm named Adaptive Expansion AE with its instantiation COExpander, positioned to leverage both advantages of GP and LC. COExpander utilizes informative heatmaps generated by a global predictor, which is learned under the guidance of locally determined partial solutions, to in turn direct the expansion of determined decision variables with adaptive step-sizes. To ensure transparent evaluation, we further take the lead to canonicalize 29 benchmarks spanning 6 popular COPs (MIS, MCl, MVC, MCut, TSP, ATSP) and various scales (50-10K nodes), upon which experiments demonstrate concrete SOTA performance of COExpander over these tasks. Source code and our standardized datasets will be made public.
Paperid:2299
Authors:Seul Lee · Karsten Kreis · Srimukh Veccham · Meng Liu · Danny Reidenbach · Yuxing Peng · Saee Paliwal · Weili Nie · Arash Vahdat
Abstract:
Drug discovery is a complex process that involves multiple stages and tasks. However, existing molecular generative models can only tackle some of these tasks. We presentGeneralist Molecular generative model(GenMol), a versatile framework that uses only asinglediscrete diffusion model to handle diverse drug discovery scenarios. GenMol generates Sequential Attachmentbased Fragment Embedding (SAFE) sequences through non-autoregressive bidirectional parallel decoding, thereby allowing the utilization of a molecular context that does not rely on the specific token ordering while having better sampling efficiency. GenMol uses fragments as basic building blocks for molecules and introducesfragment remasking, a strategy that optimizes molecules by regenerating masked fragments, enabling effective exploration of chemical space. We further proposemolecular context guidance(MCG), a guidance method tailored for masked discrete diffusion of GenMol. GenMol significantly outperforms the previous GPT-based model inde novogeneration and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design.
Paperid:2300
Authors:John Schultz · Jakub Adamek · Matej Jusup · Marc Lanctot · Michael Kaisers · Sarah Perrin · Daniel Hennes · Jeremy Shar · Cannada Lewis · Anian Ruoss · Tom Zahavy · Petar Veličković · Laurel Prince · Satinder Singh · Eric Malmi · Nenad Tomasev
Abstract:
Advancing planning and reasoning capabilities of Large Language Models (LLMs) is one of the key prerequisites towards unlocking their potential for performing reliably in complex and impactful domains. In this paper, we aim to demonstrate this across board games (Chess, Fischer Random / Chess960, Connect Four, and Hex), and we show that searchbased planning can yield significant improvements in LLM game-playing strength. We introduce, compare and contrast two major approaches: Inexternal search, the model guides Monte Carlo Tree Search (MCTS) rollouts and evaluations without calls to an external game engine, and ininternal search, the model is trained to generate in-context a linearized tree of search and a resulting final choice. Both build on a language model pre-trained on relevant domain knowledge, reliably capturing the transition and value functions in the respective environments, with minimal hallucinations. We evaluate our LLM search implementations against game-specific state-of-the-art engines, showcasing substantial improvements in strength over the base model, and reaching Grandmaster-level performance in chess while operating closer to the human search budget. Our proposed approach, combining search with domain knowledge, is not specific to board games, hinting at more general future applications.
Paperid:2301
Authors:Alexandra Proca · Clémentine Dominé · Murray Shanahan · Pedro Mediano
Abstract:
Recurrent neural networks (RNNs) are powerful models used widely in both machine learning and neuroscience to learn tasks with temporal dependencies and to model neural dynamics. However, despite significant advancements in the theory of RNNs, there is still limited understanding of their learning process and the impact of the temporal structure of data. Here, we aim to bridge this gap by analyzing the learning dynamics in linear RNNs (LRNNs) analytically. We derive an analytical solution to the learning dynamics of LRNNs under certain conditions, enabled by a novel framework that accounts for task dynamics. Our mathematical result further reveals three key properties of LRNNs: (1) Task dynamics impact solution stability and extrapolation. (2) The tradeoff between recurrent and feedforward computation is governed by a phase transition that leads to lowrank solutions. (3) Recurrence facilitates rich learning, as shown through a novel derivation of the neural tangent kernel for finite-width LRNNs. As a final proof-of-concept, we apply our theoretical framework to explain the behavior of LRNNs performing sensory integration tasks. Our work provides a first analytical treatment of the relationship between the temporal dependencies in tasks and learning dynamics in LRNNs, building a foundation for understanding how complex dynamic behavior emerges in cognitive models.
Paperid:2302
Authors:Ruyi Xu · Guangxuan Xiao · Haofeng Huang · Junxian Guo · Song Han
Abstract:
LongContext Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements.In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference.Across comprehensive evaluations on demanding long-context benchmarks—including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation—XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications. Code will be made publicly available upon publication.
Paperid:2303
Authors:Haoyuan Qin · Zhengzhu Liu · Chenxing Lin · Chennan Ma · Songzhu Mei · Siqi Shen · Cheng Wang
Abstract:
Parametersharing (PS) techniques have been widely adopted in cooperative Multi-Agent Reinforcement Learning (MARL). In PS, all the agents share a policy network with identical parameters, which enjoys good sample efficiency. However, PS could lead to homogeneous policies that limit MARL performance. We tackle this problem from the angle of gradient conflict among agents. We find that the existence of futile neurons whose update is canceled out by gradient conflicts among agents leads to poor learning efficiency and diversity. To address this deficiency, we propose GradPS, a gradient-based PS method. It dynamically creates multiple clones for each futile neuron. For each clone, a group of agents with low gradient-conflict shares the neuron's parameters, which are updated according to the gradients of each agent group. Our method can enjoy good sample efficiency by sharing the gradients among agents of the same clone neuron. Moreover, it can encourage diverse behaviors through independently updating an exclusive clone neuron, without gradient conflict. Through extensive experiments, we show that GradPS can learn diverse policies with promising performance.
Paperid:2304
Authors:Robbert Reijnen · Yaoxin Wu · Zaharah Bukhsh · Yingqian Zhang
Abstract:
Deep reinforcement learning (DRL) has been widely used for dynamic algorithm configuration, particularly in evolutionary computation, which benefits from the adaptive update of parameters during the algorithmic execution. However, applying DRL to algorithm configuration for multiobjective combinatorial optimization (MOCO) problems remains relatively unexplored. This paper presents a novel graph neural network (GNN) based DRL to configure multi-objective evolutionary algorithms. We model the dynamic algorithm configuration as a Markov decision process, representing the convergence of solutions in the objective space by a graph, with their embeddings learned by a GNN to enhance the state representation. Experiments on diverse MOCO challenges indicate that our method outperforms traditional and DRL-based algorithm configuration methods in terms of efficacy and adaptability. It also exhibits advantageous generalizability across objective types and problem sizes, and applicability to different evolutionary computation methods.
Paperid:2305
Authors:Donghwa Kim · Jaewook Lee · Chulhee Yun
Abstract:
We analyze the convergence rates of two popular variants of coordinate descent (CD): random CD (RCD), in which the coordinates are sampled uniformly at random, and randompermutation CD (RPCD), in which random permutations are used to select the update indices. Despite abundant empirical evidence that RPCD outperforms RCD in various tasks, the theoretical gap between the two algorithms’ performance has remained elusive. Even for the benign case of positive-definite quadratic functions with permutation-invariant Hessians, previous efforts have failed to demonstrate a provable performance gap between RCD and RPCD. To this end, we present novel results showing that, for a class of quadratics with permutation-invariant structures, the contraction rate upper bound for RPCD is always strictly smaller than the contraction rate lower bound for RCD for every individual problem instance. Furthermore, we conjecture that this function class contains the worst-case examples of RPCD among all positive-definite quadratics. Combined with our RCD lower bound, this conjecture extends our results to the general class of positive-definite quadratic functions.
Paperid:2306
Authors:Junlin Wang · Roy Xie · Shang Zhu · Jue Wang · Ben Athiwaratkun · Bhuwan Dhingra · Shuaiwen Song · Ce Zhang · James Zou
Abstract:
Building helpful and harmless large language models (LLMs) requires effective model alignment approach based on human instructions and feedback, which necessitates highquality human-labeled data. Constructing such datasets is often expensive and hard to scale, and may face potential limitations on diversity and generalization. To address these challenges, we introduce Mixture of Agents Alignment (MoAA), that leverages the collective strengths of various language models to provide high-quality data for model alignment. By employing MoAA, we enhance both supervised fine-tuning and preference optimization, leading to improved performance compared to using a single model alone to generate alignment data (e.g. using GPT-4o alone). Evaluation results show that our approach can improve win rate of LLaMA-3.1-8B-Instruct from 19.5 to 48.3 on Arena-Hard and from 22.33 to 57.23 on AlpacaEval2, highlighting a promising direction for model alignment through this new scalable and diverse synthetic data recipe. Furthermore, we demonstrate that MoAA enables a self-improvement pipeline, where models fine-tuned on MoA-generated data surpass their own initial capabilities, providing evidence that our approach can push the frontier of open-source LLMs without reliance on stronger external supervision. Data and code will be released.
Paperid:2307
Authors:Sung June Kim · Gyeongrok Oh · Heeju Ko · Daehyun Ji · Dongwook Lee · Byung-Jun Lee · Sujin Jang · Sangpil Kim
Abstract:
Navigating in an unfamiliar environment during deployment poses a critical challenge for a visionlanguage navigation (VLN) agent. Yet, test-time adaptation (TTA) remains relatively underexplored in robotic navigation, leading us to the fundamental question: what are the key properties of TTA for online VLN? In our view, effective adaptation requires three qualities: 1) flexibility in handling different navigation outcomes, 2) interactivity with external environment, and 3) maintaining a harmony between plasticity and stability. To address this, we introduce FeedTTA, a novel TTA framework for online VLN utilizing feedback-based reinforcement learning. Specifically, FeedTTA learns by maximizing binary episodic feedback, a practical setup in which the agent receives a binary scalar after each episode that indicates the success or failure of the navigation. Additionally, we propose a gradient regularization technique that leverages the binary structure of FeedTTA to achieve a balance between plasticity and stability during adaptation. Our extensive experiments on challenging VLN benchmarks demonstrate the superior adaptability of FeedTTA, even outperforming the state-of-the-art offline training methods in REVERIE benchmark with a single stream of learning.
Paperid:2308
Authors:Erpai Luo · Xinran Wei · Lin Huang · Yunyang Li · Han Yang · wang · Chang Liu · Zaishuo Xia · Jia Zhang · Bin Shao
Abstract:
Hamiltonian matrix prediction is pivotal in computational chemistry, serving as the foundation for determining a wide range of molecular properties. While SE(3) equivariant graph neural networks have achieved remarkable success in this domain, their substantial computational cost—driven by highorder tensor product (TP) operations—restricts their scalability to large molecular systems with extensive basis sets. To address this challenge, we introduceSPHNet, an efficient and scalable equivariant network, that incorporates adaptiveSParsity intoHamiltonian prediction. SPHNet employs two innovative sparse gates to selectively constrain non-critical interaction combinations, significantly reducing tensor product computations while maintaining accuracy. To optimize the sparse representation, we develop a Three-phase Sparsity Scheduler, ensuring stable convergence and achieving high performance at sparsity rates of up to 70\%. Extensive evaluations on QH9 and PubchemQH datasets demonstrate that SPHNet achieves state-of-the-art accuracy while providing up to a 7x speedup over existing models. Beyond Hamiltonian prediction, the proposed sparsification techniques also hold significant potential for improving the efficiency and scalability of other SE(3) equivariant networks, further broadening their applicability and impact.
Paperid:2309
Authors:Hang Guo · Yawei Li · Tao Dai · Shutao Xia · Luca Benini
Abstract:
Finetuning pre-trained diffusion models under limited training budgets has gained great success. In particular, the recent advances that directly fine-tune the quantized weights using Low-rank Adaptation (LoRA) further push the training efficiency. Despite the reduced training cost, we point out that the existing adaptation recipe is not inference-efficient. Specifically, additional post-training quantization (PTQ) on the tuned weights is needed for accelerated inference, which can result in a noticeable performance drop when the bit-width is low. Based on this observation, we introduce IntLoRA, which adapts the quantized diffusion models with integer-type low-rank parameters, to include inference efficiency during tuning. Specifically, our IntLoRA enables the pre-trained weights to be quantized during training, facilitating fine-tuning on consumer-level GPUs. During inference, our IntLoRA weights can be seamlessly merged into pre-trained weights to obtain the quantized downstream weights without PTQ. Extensive experiments show our IntLoRA achieves significant speedup on both training and inference without losing performance.
Paperid:2310
Authors:Zhengxuan Wu · Aryaman Arora · Atticus Geiger · Zheng Wang · Jing Huang · Dan Jurafsky · Christopher Manning · Christopher Potts
Abstract:
Finegrained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.
Paperid:2311
Authors:Bowen Jin · Jinsung Yoon · Zhen Qin · Ziqi Wang · Wei Xiong · Yu Meng · Jiawei Han · Sercan Arik
Abstract:
Large Language Models (LLMs) have revolutionized artificial intelligence with capabilities in reasoning, coding, and communication, driving innovation across industries. Their true potential depends on effective alignment to ensure correct, trustworthy and ethical behavior, addressing challenges like misinformation, hallucinations, bias and misuse. While existing Reinforcement Learning (RL)based alignment methods are notoriously complex, direct optimization approaches offer a simpler alternative.In this work, we introduce a novel direct optimization approach for LLM alignment by drawing on established Information Retrieval (IR) principles. We present a systematic framework that bridges LLM alignment and IR methodologies, mapping LLM generation and reward models to IR's retriever-reranker paradigm. Building on this foundation, we propose LLM Alignment as Retriever Preference Optimization (LarPO), a new alignment method that enhances overall alignment quality. Extensive experiments validate LarPO's effectiveness with 38.9 % and 13.7 % averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work opens new avenues for advancing LLM alignment by integrating IR foundations, offering a promising direction for future research.
Paperid:2312
Authors:Chinmay Savadikar · Xi Song · Tianfu Wu
Abstract:
Finetuning large pretrained Transformer models can focus on either introducing a small number of new learnable parameters (parameter efficiency) or editing representations of a small number of tokens using lightweight modules (representation efficiency). While the pioneering method LoRA (Low-Rank Adaptation) inherently balances parameter, compute, and memory efficiency, many subsequent variants trade off compute and memory efficiency and/or performance to further reduce fine-tuning parameters. To address this limitation and unify parameter-efficient and representation-efficient fine-tuning, we propose Weight-Aware Fine-Tuning (WAFT), a novel approach thatlearns to generate fine-tuning weights directly from the pretrained weights. WAFT employs a simple low-rank formulation consisting of two linear layers, shared across multiple layers of the pretrained model. This design achieves multi-faceted efficiency in parameters, representations, compute, and memory, while maintaining or exceeding the performance of LoRA and its variants. Extensive experiments on commonsense reasoning, arithmetic reasoning, instruction following, code generation, and visual recognition verify the effectiveness of our proposed WAFT.
Paperid:2313
Authors:Jing Qiao · Yu Liu · Zengzhe Chen · Mingyi Li · YUAN YUAN · Xiao Zhang · Dongxiao Yu
Abstract:
Abstract:This paper investigates decentralized unlearning, aiming to eliminate the impact of a specific client on the whole decentralized system. However, decentralized communication characterizations pose new challenges for effective unlearning: the indirect connections make it difficult to trace the specific client's impact, while the dynamic topology limits the scalability of retrainingbased unlearning methods.In this paper, we propose the first **P**rovable **D**ecentralized **U**nlearning algorithm under **D**ynamic **T**opologies called PDUDT. It allows clients to eliminate the influence of a specific client without additional communication or retraining. We provide rigorous theoretical guarantees for PDUDT, showing it is statistically indistinguishable from perturbed retraining. Additionally, it achieves an efficient convergence rate of $\mathcal{O}(\frac{1}{T})$ in subsequent learning, where $T$ is the total communication rounds. This rate matches state-of-the-art results. Experimental results show that compared with the Retrain method, PDUDT saves more than 99\% of unlearning time while achieving comparable unlearning performance.
Paperid:2314
Authors:Xinyuan Fan · Bufan Li · Chenlei Leng · Weichi Wu
Abstract:
This paper introduces Graphon Attachment Network Models (GANM), a novel framework for modeling evolving networks with rich structural dependencies, grounded in graphon theory. GAN-M provides a flexible and interpretable foundation for studying network formation by leveraging graphon functions to define attachment probabilities, thereby combining the strengths of graphons with a temporal perspective. A key contribution of this work is a methodology for learning structural changes in these networks over time. Our approach uses graph counts---frequencies of substructures such as triangles and stars—to capture shifts in network topology. We propose a new statistic designed to learn changes in the resulting piecewise polynomial signals and develop an efficient method for change detection, supported by theoretical guarantees. Numerical experiments demonstrate the effectiveness of our approach across various network settings, highlighting its potential for dynamic network analysis.
Paperid:2315
Authors:Matteo Zecchin · Sangwoo Park · Osvaldo Simeone
Abstract:
We introduce adaptive learnthen-test (aLTT), an efficient hyperparameter selection procedure that provides finite-sample statistical guarantees on the population risk of AI models. Unlike the existing learn-then-test (LTT) technique, which relies on conventional p-value-based multiple hypothesis testing (MHT), aLTT implements sequential data-dependent MHT with early termination by leveraging e-processes. As a result, aLTT can reduce the number of testing rounds, making it particularly well-suited for scenarios in which testing is costly or presents safety risks. Apart from maintaining statistical validity, in applications such as online policy selection for offline reinforcement learning and prompt engineering, aLTT is shown to achieve the same performance as LTT while requiring only a fraction of the testing rounds.
Paperid:2316
Authors:Mingjun Pan · Guanquan Lin · You-Wei Luo · Zhien Dai · Bin Zhu · Lijun Sun · Chun Yuan
Abstract:
Reinforcement Learning (RL) has emerged as a powerful tool for neural combinatorial optimization, enabling models to learn heuristics that solve complex problems without requiring expert knowledge. Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast combinatorial action spaces, leading to inefficiency. In this paper, we proposePreference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling, emphasizing the superiority among sampled solutions. Methodologically, by reparameterizing the reward function in terms of policy and utilizing preference models, we formulate an entropyregularized RL objective that aligns the policy directly with preferences while avoiding intractable computations. Furthermore, we integrate local search techniques into the fine-tuning rather than post-process to generate high-quality preference pairs, helping the policy escape local optima. Empirical results on various benchmarks, such as the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP) and the Flexible Flow Shop Problem (FFSP), demonstrate that our method significantly outperforms existing RL algorithms, achieving superior convergence efficiency and solution quality. Our work offers a simple yet efficient algorithmic advancement in RL for combinatorial optimization.
Paperid:2317
Authors:Xinyu Yang · Tom Zollo · Benjamin Eyre · Amogh Inamdar · David Madras · Richard Zemel
Abstract:
As machine learning models grow increasingly competent, their predictions are being used to supplement scarce or expensive data for estimating important quantities. By combining a small set of highfidelity observed data (i.e., genuine measurements) with a larger set of imputed data (i.e., model predictions), practitioners can enhance estimation quality beyond what either source provides in isolation.Though this paradigm shows promise, existing frameworks focus narrowly on estimating means or single quantiles, limiting their applicability for many critical domains and use cases.To address this challenge, we introduceQuEst, a framework to incorporate both observed and imputed data to estimate and provide rigorous confidence intervals for quantile-based distributional measures. Such quantile-based measures include tail measures such as CVaR, population segments like quartiles, and other key quantities of interest in fields including economics, sociology, education, and medicine. As part of QuEst, we also introduce an algorithm to estimate these statistics for multidimensional measurements and metrics.Further, we offer a novel spline function based method for optimizing our method (as well as other existing methods for such hybrid estimation).We demonstrate the utility of our framework through experiments in economic modeling, opinion polling, and language model auto-evaluation.
Paperid:2318
Authors:Ning Liu · Yue Yu
Abstract:
Attention mechanisms have emerged as transformative tools in core AI domains such as natural language processing and computer vision. Yet, their largely untapped potential for modeling intricate physical systems presents a compelling frontier. Learning such systems often entails discovering operators that map between functional spaces using limited instances of function pairsa task commonly framed as a severely ill-posed inverse PDE problem. In this work, we introduce Neural Interpretable PDEs (NIPS), a novel neural operator architecture that builds upon and enhances Nonlocal Attention Operators (NAO) in both predictive accuracy and computational efficiency. NIPS employs a linear attention mechanism to enable scalable learning and integrates a learnable kernel network that acts as a channel-independent convolution in Fourier space. As a consequence, NIPS eliminates the need to explicitly compute and store large pairwise interactions, effectively amortizing the cost of handling spatial interactions into the Fourier transform. Empirical evaluations demonstrate that NIPS consistently surpasses NAO and other baselines across diverse benchmarks, heralding a substantial leap in scalable, interpretable, and efficient physics learning.
Paperid:2319
Authors:Zhengyu Fang · Zhimeng Jiang · Huiyuan Chen · Xiao Li · Jing Li
Abstract:
Tabular data generation has attracted significant research interest in recent years, with the tabular diffusion models greatly improving the quality of synthetic data. However, while memorization—where models inadvertently replicate exact or nearidentical training data—has been thoroughly investigated in image and text generation, its effects on tabular data remain largely unexplored. In this paper, we conduct the first comprehensive investigation of memorization phenomena in diffusion models for tabular data. Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. Additionally, we provide a theoretical explanation for why memorization occurs in tabular diffusion models. \blue{To address this issue, we propose TabCutMix, a simple yet effective data augmentation technique that exchanges randomly selected feature segments between random same-class training sample pairs. Building upon this, we introduce TabCutMixPlus, an enhanced method that clusters features based on feature correlations and ensures that features within the same cluster are exchanged together during augmentation. This clustering mechanism mitigates out-of-distribution (OOD) generation issues by maintaining feature coherence.} Experimental results across various datasets and diffusion models demonstrate that TabCutMix effectively mitigates memorization while maintaining high-quality data generation. Our code is available at https://anonymous.4open.science/r/TabCutMix-3F7B.
Paperid:2320
Authors:Suyu Liu · Zhiguang Cao · Shanshan Feng · Yew Soon ONG
Abstract:
Solving different kinds of vehicle routing problems (VRPs) with a unified neural solver has gained huge attention over the past years. Despite their effectiveness, existing neuralbased multi-task solvers often overlook the geometric structures inherent in different tasks, which can lead to sub-optimal performance. To address this potential limitation, in this paper, we propose a curvature-aware pre-training framework that enables the model to fully exploit the rich geometric information embedded in the tasks. Specifically, we introduce a multi-task vehicle routing solver that incorporates mixed-curvature spaces during the feature fusion stage, encouraging the model to capture the underlying geometry properties of each instance. Through extensive experiments, we evaluate the proposed strategy on existing neural solver architectures and various testing scenarios. The results demonstrate that the curvature-aware pre-training approach not only bring benefits to generalization abilities of existing neural VRPs solvers on synthetic datasets, but also improve solution qualities on real-world benchmarks.
Paperid:2321
Authors:Ruilin Wang · Huixia Li · Yuexiao Ma · Xiawu Zheng · Fei Chao · Xuefeng Xiao · Rongrong Ji
Abstract:
Abstract:Inference latency stands as a critical bottleneck in the largescale deployment of Large Language Models (LLMs). Speculative decoding methods have recently shown promise in accelerating inference without compromising the output distribution. However, existing work typically relies on a dualistic draft-verify framework and lacks rigorous theoretical grounding. In this paper, we introduce a novel \emph{polybasic} speculative decoding framework, underpinned by a comprehensive theoretical analysis. Specifically, we prove a fundamental theorem that characterizes the optimal inference time for multi-model speculative decoding systems, shedding light on how to extend beyond the dualistic approach to a more general polybasic paradigm. Through our theoretical investigation of multi-model token generation, we expose and optimize the interplay between model capabilities, acceptance lengths, and overall computational cost. Our framework supports both standalone implementation and integration with existing speculative techniques, leading to accelerated performance in practice. Experimental results across multiple model families demonstrate that our approach yields speedup ratios ranging from $3.31\times$ to $4.01\times$ for LLaMA2-Chat 7B, up to $3.87 \times$ for LLaMA3-8B, up to $4.43 \times$ for Vicuna-7B and up to $3.85 \times$ for Qwen2-7B---all while preserving the original output distribution. We release our theoretical proofs and implementation code to facilitate further investigation into polybasic speculative decoding.
Paperid:2322
Authors:Ruitao Pu · Yang Qin · Xiaomin Song · Dezhong Peng · Zhenwen Ren · Yuan Sun
Abstract:
Recently, numerous crossmodal hashing (CMH) methods have been proposed, yielding remarkable progress. As a static learning paradigm, existing CMH methods often implicitly assume that all modalities are prepared before processing. However, in practice applications (such as multi-modal medical diagnosis), it is very challenging to collect paired multi-modal data simultaneously. Specifically, they are collected chronologically, forming streaming-media data (SMA). To handle this, all previous CMH methods require retraining on data from all modalities, which inevitably limits the scalability and flexibility of the model. In this paper, we propose a novel CMH paradigm named Streaming-media Hashing rEtrieval (SHE) that enables parallel training of each modality. Specifically, we first propose a knowledge library mining module (KLM) that extracts a prototype knowledge library for each modality, thereby revealing the commonality distribution of the instances from each modality. Then, we propose a knowledge library transfer module (KLT) that updates and aligns the new knowledge by utilizing the historical knowledge library, ensuring semantic consistency. Finally, to enhance intra-class semantic relevance and inter-class semantic disparity, we develop a discriminative hashing learning module (DHL). Comprehensive experiments on four benchmark datasets demonstrate the superiority of our SHE compared to 14 competitors.
Paperid:2323
Authors:Runsheng Bai · Bo Liu · qiang liu
Abstract:
Large Language Models (LLMs) exhibit impressive performance across various tasks, but deploying them for inference poses challenges. Their high resource demands often necessitate complex, costly multiGPU pipelines, or the use of smaller, less capable models. While quantization offers a promising solution utilizing lower precision for model storage, existing methods frequently experience significant performance drops at lower precision levels. Additionally, they typically provide only a limited set of solutions at specific bit levels, many of which are extensively manually tuned. To address these challenges, we propose a new method called \textbf{SKIM}: Scaled K-means clustering wIth Mixed precision. Our approach introduces two novel techniques: 1. A \textit{greedy algorithm} to solve approximately optimal bit allocation across weight channels, and 2. A \textit{trainable scaling vector} for non-differentiable K-means clustering. These techniques substantially improve the model performance and can be adapted to any given bit. Notably, in terms of perplexity, our method narrows the gap between quantized LLaMA models and their full precision counterparts by around \textbf{14\%} on average.
Paperid:2324
Authors:Zhiwei Tang · Jiangweizhi Peng · Jiasheng Tang · Mingyi Hong · Fan Wang · Tsung-Hui Chang
Abstract:
In this work, we focus on the alignment problem of diffusion models with a continuous reward function, which represents specific objectives for downstream tasks, such as increasing darkness or improving the aesthetics of images. The central goal of the alignment problem is to adjust the distribution learned by diffusion models such that the generated samples maximize the target reward function. We propose a novel alignment approach, named Direct Noise Optimization (DNO), that optimizes the injected noise during the sampling process of diffusion models. By design, DNO operates at inferencetime, and thus is tuning-free and prompt-agnostic, with the alignment occurring in an online fashion during generation. We rigorously study the theoretical properties of DNO and also propose variants to deal with non-differentiable reward functions. Furthermore, we identify that naive implementation of DNO occasionally suffers from the out-of-distribution reward hacking problem, where optimized samples have high rewards but are no longer in the support of the pretrained distribution. To remedy this issue, we leverage classical high-dimensional statistics theory to an effective probability regularization technique. We conduct extensive experiments on several important reward functions and demonstrate that the proposed DNO approach can achieve state-of-the-art reward scores within a reasonable time budget for generation.
Paperid:2325
Authors:raghav singhal · Zachary Horvitz · Ryan Teehan · Mengye Ren · Zhou Yu · Kathleen McKeown · Rajesh Ranganath
Abstract:
Diffusion models are a flexible class of generative models. However, generating samples with userspecified properties remains a challenge. One solution is to fine-tune models to maximize reward functions which capture desired properties. However, fine-tuning can be expensive. In this work, we propose Feynman-Kac (FK) steering, a framework for inference-time steering diffusion models with reward functions. FK steering works by generating multiple trajectories, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are chosen such that a high score indicates the particle will yield a high-reward sample. We explore various choices of potentials, rewards, and samplers. Steering text-to-image models with a human preference reward, we find that FK steering outperforms fine-tuned models with just 2 particles. Moreover, FK steering a 0.8B parameter model outperforms a 2.6B model, achieving state-of-the-art performance on prompt fidelity. We also steer text diffusion models with rewards for text quality and rare attributes such as toxicity, and find that FK steering generates lower perplexity text and enables gradient-free control. Overall, inference-time scaling and steering of diffusion models – even training-free – provides significant quality and controllability benefits.
Paperid:2326
Authors:Parikshit Pareek · Abhijith Jayakumar · Kaarthik Sundar · Deepjyoti Deka · Sidhant Misra
Abstract:
Constrained optimization problems arise in various engineering systems such as inventory management and power grids. Standard deep neural network (DNN) based machine learning proxies are ineffective in practical settings where labeled data is scarce and training times are limited. We propose a semisupervised Bayesian Neural Networks (BNNs) based optimization proxy for this complex regime, wherein training commences in a sandwiched fashion, alternating between a supervised learning step for minimizing cost, and an unsupervised learning step for enforcing constraint feasibility. We show that the proposed semi-supervised BNN outperforms DNN architectures on important non-convex constrained optimization problems from energy network operations, achieving up to a tenfold reduction in expected maximum equality gap and halving the inequality gaps. Further, the BNN's ability to provide posterior samples is leveraged to construct practically meaningful probabilistic confidence bounds on performance using a limited validation data, unlike prior methods.
Paperid:2327
Authors:Siddharth Gollapudi · Ravishankar Krishnaswamy · Kirankumar Shiragur · Harsh Wardhan
Abstract:
Abstract:Graphbased data structures have become powerful and ubiquitous tools for scalable approximate nearest-neighbor (ANN) search over the past decade. In spite of their apparent practical performance, there has only recently been progress on the **worst-case** performance of these data structures. Indeed, the influential work of \citeauthor{IndykXu} introduced the key concept of $\alpha$-reachable graphs, showing that graphs constructed by the DiskANN algorithm \cite{DiskANN19} produce an $\left(\frac{\alpha+1}{\alpha-1}\right)$-approximate solution with a simple best-first search that runs in poly-logarithmic query time. In our work, we improve and generalize this analysis as follows: - We introduce **sorted** $\alpha$-reachable graphs, and use this notion to obtain a stronger approximation factor of $\frac{\alpha}{\alpha-1}$ for the DiskANN algorithm on Euclidean metrics. - We present the **first** worst-case theoretical analysis for the popular **beam-search** algorithm, which is used in practice to search these graphs for $k > 1$ candidate nearest neighbors.We also present empirical results validating the significance of sorted $\alpha$-reachable graphs, which aligns with our theoretical findings.
Paperid:2328
Authors:Yanbo Wang · Justin Dauwels · Yilun Du
Abstract:
Generative models have demonstrated remarkable abilities in generating highfidelity visual content. In this work, we explore how generative models can further be used not only to synthesize visual content but also to understand the properties of a scene given a natural image. We formulate scene understanding as an inverse generative modeling problem, where we seek to find conditional parameters of a visual generative model to best fit a given natural image. To enable this procedure to infer scene structure from images substantially different than those seen during training, we further propose to build this visual generative model compositionally from smaller models over pieces of a scene. We illustrate how this procedure enables us to infer the set of objects in a scene, enabling robust generalization to new test scenes with an increased number of objects of new shapes. We further illustrate how this enables us to infer global scene factors, likewise enabling robust generalization to new scenes. Finally, we illustrate how this approach can be directly applied to existing pretrained text-to-image generative models for zero-shot multi-object perception.
Paperid:2329
Authors:Da Kuang · GuanWen Qiu · Junhyong Kim
Abstract:
How a single fertilized cell gives rise to a complex array of specialized cell types in development is a central question in biology. The cells replicate, generating celllineages, and acquire differentiated characteristics through poorly understood molecular processes. A key approach to studying developmental processes is to infer the tree graph of cell-lineage histories, providing an analytical framework for dissecting individual cell’s molecular decisions during replication and differentiation (i.e., acquisition of specialized traits). Although genetically engineered lineage-tracing methods have advanced the field, they are either infeasible or ethically constrained in many organisms. By contrast, modern single-cell technologies can measure high-content molecular profiles (\textit{e.g.}, transcriptomes) in a wide range of biological systems. Here, we introduce \emph{CellTreeQM}, a novel deep learning method based on transformer architectures that learns an embedding space with geometric properties optimized for tree-graph inference. By formulating the lineage reconstruction problem as tree-metric learning, we systematically explore supervised, weakly supervised, and unsupervised training settings and present \emph{Cell Lineage Reconstruction Benchmark} to facilitate comprehensive evaluation. This benchmark includes (1) synthetic data modeled via Brownian motion with independent noise and spurious signals, (2) lineage-resolved single-cell RNA sequencing datasets. Experimental results show that \emph{CellTreeQM} recovers lineage structures with minimal supervision and limited data, offering a scalable framework for uncovering cell lineage relationships. To our knowledge, this is the first method to cast cell lineage inference explicitly as a metric learning task, paving the way for future computational models aimed at uncovering the molecular dynamics of cell lineage.
Paperid:2330
Authors:Adibvafa Fallahpour · Jun Ma · Alif Munim · Hongwei Lyu · BO WANG
Abstract:
As a cornerstone of diagnostic imaging, chest Xrays (CXRs) play an integral role in driving critical decisions in disease management and patient care. While recent innovations have led to specialized models for various CXR interpretation tasks, these solutions often operate in isolation, limiting their practical utility in clinical practice. We present MedRAX, the first versatile AI agent that seamlessly integrates state-of-the-art CXR analysis tools and multimodal large language models into a unified framework. MedRAX dynamically leverages these models to address complex medical queries without requiring additional training. To rigorously evaluate its capabilities, we introduce ChestAgentBench, a comprehensive benchmark containing 2,500 complex medical queries across 7 diverse categories. Our experiments demonstrate that MedRAX achieves state-of-the-art performance compared to both open-source and proprietary models, representing a significant step toward the practical deployment of automated CXR interpretation systems. Data and code will be publicly available at https://medrax25.github.io.
Paperid:2331
Authors:Hao Dai · Jagmohan Chauhan
Abstract:
Continual Generalized Category Discovery (CGCD) faces a critical challenge: incrementally learning new classes from unlabeled data streams while preserving knowledge of old classes. Existing methods struggle with catastrophic forgetting, especially when unlabeled data mixes known and novel categories. We address this by analyzing C-GCD’s forgetting dynamics through a Bayesian lens, revealing that covariance misalignment between old and new classes drives performance degradation. Building on this insight, we propose Variational Bayes C-GCD (VB-CGCD), a novel framework that integrates variational inference with covariance-aware nearest-class-mean classification. VB-CGCD adaptively aligns class distributions while suppressing pseudo-label noise via stochastic variational updates. Experiments show VB-CGCD surpasses prior art by +15.21% with the overall accuracy in the final session on standard benchmarks. We also introduce a new challenging benchmark with only 10% labeled data and extended online phases—VB-CGCD achieves a 67.86% final accuracy, significantly higher than state-of-the-art (38.55%), demonstrating its robust applicability across diverse scenarios.
Paperid:2332
Authors:Dheeraj Baby · Yifei Tang · Hieu Nguyen · Yu-Xiang Wang · Rohit Pyati
Abstract:
Abstract:In this paper, we study the problem of estimation and learning under temporal distribution shift. Consider an observation sequence of length $n$, which is a noisy realization of a timevarying ground-truth sequence. Our focus is to develop methods to estimate the groundtruth at the final time-step while providing sharp point-wise estimation error rates. We show that, *without prior knowledge* on the level of temporal shift, a wavelet soft-thresholding estimator provides an *optimal* estimation error bound for the groundtruth. Our proposed estimation method generalizes existing researches (Mazetto and Upfal, 2023) by establishing a connection between the sequence's non-stationarity level and the sparsity in the wavelet-transformed domain. Our theoretical findings are validated by numerical experiments. Additionally, we applied the estimator to derive sparsity-aware excess risk bounds for binary classification under distribution shift and to develop computationally efficient training objectives. As a final contribution, we draw parallels between our results and the classical signal processing problem of total-variation denoising (Mammen and van de Geer 1997; Tibshirani 2014 ), uncovering *novel optimal* algorithms for such task.
Paperid:2333
Authors:Francesco Tonin · Alex Lambert · Johan Suykens · Volkan Cevher
Abstract:
Fairness of decisionmaking algorithms is an increasingly important issue. In this paper, we focus on spectral clustering with group fairness constraints, where every demographic group is represented in each cluster proportionally as in the general population. We present a new efficient method for fair spectral clustering (Fair SC) by casting the Fair SC problem within the difference of convex functions (DC) framework. To this end, we introduce a novel variable augmentation strategy and employ an alternating direction method of multipliers type of algorithm adapted to DC problems. We show that each associated subproblem can be solved efficiently, resulting in higher computational efficiency compared to prior work, which required a computationally expensive eigendecomposition. Numerical experimentsdemonstrate the effectiveness of our approach on both synthetic and real-world benchmarks, showing significant speedups in computation time over prior art, especially as the problem size grows. This work thus represents a considerable step forward towards the adoption of fair clustering in real-world applications.
Paperid:2334
Authors:Sucheng Ren · Qihang Yu · Ju He · Xiaohui Shen · Alan Yuille · Liang-Chieh Chen
Abstract:
Autoregressive (AR) modeling has achieved remarkable success in natural language processing by enabling models to generate text with coherence and contextual understanding through next token prediction. Recently, in image generation, VAR proposes scalewise autoregressive modeling, which extends the next token prediction to the next scale prediction, preserving the 2D structure of images. However, VAR encounters two primary challenges: (1) its complex and rigid scale design limits generalization in next scale prediction, and (2) the generator’s dependence on a discrete tokenizer with the same complex scale structure restricts modularity and flexibility in updating the tokenizer. To address these limitations, we introduce FlowAR, a general next scale prediction method featuring a streamlined scale design, where each subsequent scale is simply double the previous one. This eliminates the need for VAR’s intricate multi-scale residual tokenizer and enables the use of any off-the-shelf Variational AutoEncoder (VAE). Our simplified design enhances generalization in next scale prediction and facilitates the integration of Flow Matching for high-quality image synthesis. We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark, demonstrating superior generation performance compared to previous methods.
Paperid:2335
Authors:Sivan Sabato · Eran Treister · Elad Yom-Tov
Abstract:
We propose methods for auditing multiclass classifiers for fairness under multiclass equalized odds, by estimating the deviation from equalized odds when the classifier is not completely fair. We generalize to multiclass classifiers the measure of Disparate Conditional Prediction (DCP), originally suggested by Sabato & YomTov (2020) for binary classifiers. DCP is defined as the fraction of the population for which the classifier predicts with conditional prediction probabilities that differ from the closest common baseline. We provide new local-optimization methods for estimating the multiclass DCP under two different regimes, one in which the conditional confusion matrices for each protected sub-population are known, and one in which these cannot be estimated, for instance, because the classifier is inaccessible orbecause good-quality individual-level data is not available. These methods can be used to detect classifiers that likely treat a significant fraction of the population unfairly. Experiments demonstrate the accuracy of the methods. The code for the experiments is provided as supplementary material.
Paperid:2336
Authors:Rihong Qiu · Xinke Jiang · Yuchen Fang · Hongbin Lai · Hao Miao · Xu Chu · Junfeng Zhao · Yasha Wang
Abstract:
Graph Neural Networks have become indispensable for modeling intricate graph structures with the success of deep learning. However, pretrained GNNs are hard to be directly utilized on diverse downstream tasks without finetuning-based continual learning, which incurs high computational costs hindering the development of the upcoming Large Graph Models (LGMs). In this paper, we study efficient and universal Graph Continual Learning (GCL) on various downstream tasks via dataset distillation, and dataset distillation is achieved by a novel Lightweight Graph Neural Tangent Kernel framework (LIGHTGNTK).Specifically, LIGHTGNTK utilizes the low-rank approximation of the Laplacian matrix in GNTK to efficiently capture both structure and feature relationships enabling gradient-based dataset distillation.Aditionally, benefiting from the proposed unified subgraph anchoring strategy, LIGHTGNTK can support not only graph-level but also node- and edge-level downstream tasks under different input structures.Extensive experiments on multiple datasets show that LIGHTGNTK achieves state-of-the-art performance in GCL scenarios, facilitating adaptive and scalable LGMs. The associated code and all datasets can be found at https://anonymous.4open.science/r/LightGNTK.
Paperid:2337
Authors:Yu Wang · Shiwei Chen
Abstract:
Weakly supervised video anomaly detection (WSVAD) is tasked with pinpointing temporal intervals containing anomalous events within untrimmed videos, utilizing only video-level annotations. However, a significant challenge arises due to the absence of dense frame-level annotations, often leading to incomplete localization in existing WS-VAD methods. To address this issue, we present a novel LEC-VAD, Learning Event Completeness for Weakly Supervised Video Anomaly Detection, which features a dual structure designed to encode both category-aware and category-agnostic semantics between vision and language. Within LEC-VAD, we devise semantic regularities that leverage an anomaly-aware Gaussian mixture to learn precise event boundaries, thereby yielding more complete event instances. Besides, we develop a novel memory bank-based prototype learning mechanism to enrich concise text descriptions associated with anomaly-event categories. This innovation bolsters the text's expressiveness, which is crucial for advancing WS-VAD. Our LEC-VAD demonstrates remarkable advancements over the current state-of-the-art methods on two benchmark datasets XD-Violence and UCF-Crime.
Paperid:2338
Authors:Hung-Chieh Fang · Po-Yi Lu · Hsuan-Tien (Tien) Lin
Abstract:
Universal Domain Adaptation (UniDA) addresses unsupervised domain adaptation where target classes may differ arbitrarily from source ones, except for a shared subset. An important approach, partial domain matching (PDM), aligns only shared classes but struggles in extreme cases where many source classes are absent in the target domain, underperforming the most naive baseline that trains on only source data. In this work, we identify that the failure of PDM for extreme UniDA stems from dimensional collapse (DC) in target representations. To address target DC, we propose to jointly leverage the alignment and uniformity techniques in modern selfsupervised learning (SSL) on the unlabeled target data to preserve the intrinsic structure of the learned representations. Our experimental results confirm that SSL consistently advances PDM and delivers new state-of-the-art results across a broader benchmark of UniDA scenarios with different portions of shared classes, representing a crucial step toward truly comprehensive UniDA.
Paperid:2339
Authors:xiaokun Feng · Dailing Zhang · Shiyu Hu · Xuchen Li · Meiqi Wu · Jing Zhang · Xiaotang Chen · Kaiqi Huang
Abstract:
Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (e.g., depth, thermal, and event data, denoted as X) is the core of RGBX tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling.To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking.Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a compact spatial feature, enabling thorough intra- and inter-modality spatial modeling. Additionally, we design an efficient Temporal Compact Module that compactly represents temporal features by constructing the refined target distribution heatmap. Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks.
Paperid:2340
Authors:Ibrahim Fayad · Max Zimmer · Martin Schwartz · Philippe CIAIS · Fabian Gieseke · Aurélien de Truchis · Sarah Brood · Gabriel Belouze · Alexandre d'Aspremont
Abstract:
In recent years, significant efforts have been directed towards adapting selfsupervised multimodal learning for Earth observation. However, existing methods produce coarse patch-sized embeddings, limiting their effectiveness and integration with other modalities like LiDAR. To close this gap, we present DUNIA, an approach to learn pixel-sized embeddings through cross-modal alignment between images and full-waveform LiDAR data. As the model is trained in a contrastive manner, the embeddings can be directly leveraged in the context of a variety of environmental monitoring tasks in a zero-shot setting. In our experimental evaluation, we demonstrate the effectiveness of the embeddings for seven such tasks (canopy height mapping, fractional canopy cover, land cover mapping, tree species identification, plant area index, crop type classification, and per-pixel waveform-based vertical structure mapping). The results show that the embeddings along with zero-shot classifiers often outperform specialized supervised models both quantitatively and qualitatively, even in low-labeled data regimes. In the fine-tuning setting, we show strong low-shot capabilities with performances near or better than state of the art on five out of six tasks.
Paperid:2341
Authors:Dongliang Guo · Mengxuan Hu · Zihan Guan · Thomas Hartvigsen · Sheng Li
Abstract:
Large multimodal models inevitably decay over time as facts change and previously learned information becomes outdated. Traditional approaches such as fine-tuning are often impractical for updating these models due to their size and complexity. Instead, direct knowledge editing within the models presents a more viable solution. Current model editing techniques, however, typically overlook the unique influence ranges of different facts, leading to compromised model performance in terms of both generality and locality. To address this issue, we introduce the concept of the generality-locality trade-off in multi-modal model editing. We develop a new model editing dataset named OKEDIT, specifically designed to effectively evaluate this trade-off. Building on this foundation, we propose \textbf{BalancEdit}, a novel method for balanced model editing that dynamically achieves an optimal balance between generality and locality. BalancEdit utilizes a unique mechanism that generates both positive and negative samples for each fact to accurately determine its influence scope and incorporates these insights into the model's latent space using a discrete, localized codebook of edits, without modifying the underlying model weights. To our knowledge, this is the first approach explicitly addressing the generality-locality trade-off in multi-modal model editing. Our comprehensive results confirm the effectiveness of BalancEdit, demonstrating minimal trade-offs while maintaining robust editing capabilities. Our code and dataset will be available.
Paperid:2342
Authors:Kunjie Ren · Luo Luo
Abstract:
This paper considers zerothorder optimization for stochastic convex minimization problem. We propose a parameter-free stochastic zeroth-order method (POEM) by introducing a step-size scheme based on the distance over finite difference and an adaptive smoothing parameter. We provide the theoretical analysis to show that POEM achieves the near-optimal stochastic zeroth-order oracle complexity. We further conduct the numerical experiments to demonstrate POEM outperforms existing zeroth-order methods in practice.
Paperid:2343
Authors:Xiancheng Sun · Senmao Ma · Shengxi Li · Mai Xu · Jingyuan Xia · Lai Jiang · Xin Deng · Jiali Wang
Abstract:
Panoramic image outpainting acts as a pivotal role in immersive content generation, allowing for seamless restoration and completion of panoramic content. Given the fact that the majority of generative outpainting solutions operate on planar images, existing panoramic methods address the sphere nature by soft regularisation during the endto-end learning, which still fails to fully exploit the spherical content. In this paper, we set out the first attempt to impose the sphere nature in the design of diffusion model, such that the panoramic format is intrinsically ensured during the learning procedure, named as spherical-nested diffusion (SpND) model. This is achieved by employing spherical noise in the diffusion process to address the structural prior, together with a newly proposed spherical deformable convolution (SDC) module to intrinsically learn the panoramic knowledge. Upon this, the proposed SpND method is effectively integrated into a pre-trained diffusion model, outperforming existing state-of-the-art methods for panoramic image outpainting. Noticeably, our SpND method reduces the FID values by more than 50\% against the state-of-the-art PanoDiffusion method. Our code will be made publicly upon acceptance.
Paperid:2344
Authors:Jinyang Liu · Tessa Steensgaard · Marvin N. Wright · Niklas Pfister · Munir Hiabu
Abstract:
Many existing interpretation methods are based on Partial Dependence (PD) functions that, for a pretrained machine learning model, capture how a subset of the features affects the predictions by averaging over the remaining features. Notable methods include Shapley additive explanations (SHAP) which computes feature contributions based on a game theoretical interpretation and PD plots (i.e., 1-dim PD functions) that capture average marginal main effects. Recent work has connected these approaches using a functional decomposition and argues that SHAP values can be misleading since they merge main and interaction effects into a single local effect. However, a major advantage of SHAP compared to other PD-based interpretations has been the availability of fast estimation techniques, such asTreeSHAP. In this paper, we propose a new tree-based estimator,FastPD, which efficiently estimates arbitrary PD functions. We show thatFastPDconsistently estimates the desired population quantity -- in contrast to path-dependentTreeSHAPwhich is inconsistent when features are correlated. For moderately deep trees,FastPDimproves the complexity of existing methods from quadratic to linear in the number of observations. By estimating PD functions for arbitrary feature subsets,FastPDcan be used to extract PD-based interpretations such as SHAP, PD plots and higher-order interaction effects.
Paperid:2345
Authors:Marcel Hedman · Desi Ivanova · Cong Guan · Tom Rainforth
Abstract:
We develop a semiamortized, policy-based, approach to Bayesian experimental design (BED) called Stepwise Deep Adaptive Design (Step-DAD). Like existing, fully amortized, policy-based BED approaches, Step-DAD trains a design policy upfront before the experiment. However, rather than keeping this policy fixed, Step-DAD periodically updates it as data is gathered, refining it to the particular experimental instance. This test-time adaptation improves both the flexibility and the robustness of the design strategy compared with existing approaches. Empirically, Step-DAD consistently demonstrates superior decision-making and robustness compared with current state-of-the-art BED methods.
Paperid:2346
Authors:Hanshen Xiao · Zhen Yang · G. Edward Suh
Abstract:
This paper studies a range of AI/ML trust concepts, including memorization, data poisoning, and copyright, which can be modeled as constraints on the influence of data on a trained model, characterized by the outcome difference from a processing function (training algorithm). In this realm, we show that provable trustworthy guarantees can be efficiently provided through a new framework termed DataSpecific Indistinguishability (DSI) to select trust-preserving randomization tightly aligning with target outcome differences, as a relaxation of the classic Input-Independent Indistinguishability (III). Technically, we establish both the theoretical and algorithmic foundations of DSI with the optimal high-dimensional Gaussian mechanism. We further show its applications to develop trustworthy deep learning. The experimental results on memorization mitigation for large language models and backdoor defense show both the efficiency and effectiveness of DSI noise mechanism. We release our code in an anonymous Github link https://anonymous.4open.science/r/DT-DSI-D5E8.
Paperid:2347
Authors:Kian Ming Chai · Edwin V. Bonilla
Abstract:
We introduce a novel oneparameter variational objective that lower bounds the data evidence and enables the estimation of approximate fractional posteriors. We extend this framework to hierarchical and Bayes posteriors, offering a versatile tool for probabilistic modelling. We demonstrate two cases where gradients can be obtained analytically and a simulation study on mixture models showing that our fractional posteriors can be used to achieve better calibration compared to posteriors from the conventional variational bound. When applied to variational autoencoders (VAEs), our approach attains tighter evidence bounds and enables learning of high-performing approximate Bayes posteriors jointly with fractional posteriors. We show that VAEs trained with fractional posteriors produce decoders that are better aligned for generation from the prior.
Paperid:2348
Authors:yueheng li · Guangming Xie · Zongqing Lu
Abstract:
Cooperative MultiAgent Reinforcement Learning (MARL) has become a critical tool for addressing complex real-world problems. However, off-policy MARL methods, which rely on joint Q-functions, face significant scalability challenges due to the exponentially growing joint action space.In this work, we highlight a critical yet often overlooked issue: erroneous Q-target estimation, primarily caused by extrapolation error.Our analysis reveals that this error becomes increasingly severe as the number of agents grows, leading to unique challenges in MARL due to its expansive joint action space and the decentralized execution paradigm.To address these challenges, we propose a suite of techniques tailored for off-policy MARL, including annealed multi-step bootstrapping, averaged Q-targets, and restricted action representation. Experimental results demonstrate that these methods effectively mitigate erroneous estimations, yielding substantial performance improvements in challenging benchmarks such as SMAC, SMACv2, and Google Research Football.
Paperid:2349
Authors:Yunpeng Qu · Kun Yuan · Jinhua Hao · Kai Zhao · Qizhi Xie · Ming Sun · Chao Zhou
Abstract:
Image SuperResolution (ISR) has seen significant progress with the introduction of remarkable generative models.However, challenges such as the trade-off issues between fidelity and realism, as well as computational complexity, have also posed limitations on their application.Building upon the tremendous success of autoregressive models in the language domain, we propose \textbf{VARSR}, a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction.To effectively integrate and preserve semantic information in low-resolution images, we propose using prefix tokens to incorporate the condition.Scale-aligned Rotary Positional Encodings are introduced to capture spatial structures and the diffusion refiner is utilized for modeling quantization residual loss to achieve pixel-level fidelity.Image-based Classifier-free Guidance is proposed to guide the generation of more realistic images.Furthermore, we collect large-scale data and design a training process to obtain robust generative priors.Quantitative and qualitative results show that VARSR is capable of generating high-fidelity and high-realism images with more efficiency than diffusion-based methods.Our codes and models will be publicly available.
Paperid:2350
Authors:Niccolò Ajroldi · Antonio Orvieto · Jonas Geiping
Abstract:
Averaging checkpoints along the training trajectory is a simple yet powerful approach to improve the generalization performance of Machine Learning models and reduce training time. Motivated by these potential gains, and in an effort to fairly and thoroughly compare these techniques, we present an extensive evaluation of averaging techniques, using AlgoPerf \citep{dahlbenchmarking2023}, a largescale Deep Learning benchmark for optimization algorithms. We investigate whether weight averaging can reduce training time, improve generalization, and replace learning rate decay. Our evaluation across seven architectures and datasets reveals that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost, while improving generalization across all considered workloads.Finally, we explore the relationship between averaging and learning rate annealing and show how to optimally combine the two to achieve better performance.
Paperid:2351
Authors:Utkarsh Singhal · Ryan Feng · Stella Yu · Atul Prakash
Abstract:
Realworld perception requires invariance to transformations like viewpoint shifts and lighting changes. While data augmentation and equivariant networks address known transformations during training, they struggle with novel out-of-distribution scenarios. We propose FoCAL, a zero-shot framework that canonicalizes inputs at test-time using foundation model priors. FoCAL generates candidate transformations (e.g., 3D novel views) and selects canonical versions by optimizing CLIP and Stable Diffusion energies, requiring no training or architectural changes.Experiments show FoCAL improves robustness for CLIP and SAM across 2D/3D rotations, illumination shifts (contrast and color), and day-night changes in a zero-shot manner. We further demonstrate applications in active vision.Our work challenges the common assumption that invariance requires transform-specific training or architectural compromises.
Paperid:2352
Authors:Yuanjie Shi · Hooman Shahrokhi · Xuesong Jia · Xiongzhi Chen · Jana Doppa · Yan Yan
Abstract:
Abstract:Conformal prediction (CP) is a promising uncertainty quantification framework which works as a wrapper around a blackbox classifier to construct prediction sets (i.e., subset of candidate classes) with provable guarantees. However, standard calibration methods for CP tend to produce large prediction sets which makes them less useful in practice. This paper considers the problem of integrating conformal principles into the training process of deep classifiers to directly minimize the size of prediction sets. We formulate conformal training as a bilevel optimization problem and propose the {\em Direct Prediction Set Minimization (DPSM)} algorithm to solve it. The key insight behind DPSM is to minimize a measure of the prediction set size (upper level) that is conditioned on the learned quantile of conformity scores (lower level). We analyze that DPSM has a learning bound of $O(1/\sqrt{n})$ (with $n$ training samples),while prior conformal training methods based on stochastic approximation for the quantile has a bound of $\Omega(1/s)$ (with batch size $s$ and typically $s \ll \sqrt{n}$).Experiments on various benchmark datasets and deep models show that DPSM significantly outperforms the best prior conformal training baseline with $20.46\\%\downarrow$ in the prediction set size and validates our theory.
Paperid:2353
Authors:Yangyang Shen · Xiao Tan · Dian Shen · Meng Wang · Beilun Wang
Abstract:
Backdoor attacks have seriously threatened deep neural networks (DNNs) by embedding concealed vulnerabilities through data poisoning. To counteract these attacks, training benign models from poisoned data garnered considerable interest from researchers. Highperforming defenses often rely on additional clean subsets, which is untenable due to increasing privacy concerns and data scarcity. In the absence of clean subsets, defenders resort to complex feature extraction and analysis, resulting in excessive overhead and compromised performance. To address these challenges, we identify the key lies in sufficient utilization of both the easier-to-obtain target labels and clean hard samples. In this work, we propose a Bi-perspective Splitting Defense (BSD). BSD distinguishes clean samples using both semantic and loss statistics characteristics through open set recognition-based splitting (OSS) and altruistic model-based data splitting (ALS) respectively. Through extensive experiments on benchmark datasets and against representative attacks, we empirically demonstrate that BSD has an average improvement in the Defense Effectiveness Rating (DER) by 16.29% compared to 5 state-of-the-art defenses, achieving clean data-free backdoor security.
Paperid:2354
Authors:Sho Sonoda · Yuka Hashimoto · Isao Ishikawa · Masahiro Ikeda
Abstract:
Abstract:We present a constructive universal approximation theorem for learning machines equipped with jointgroup-equivariant feature maps, called the joint-equivariant machines, based on the group representation theory. ``Constructive'' here indicates that the distribution of parameters is given in a closed-form expression known as the ridgelet transform. Joint-group-equivariance encompasses a broad class of feature maps that generalize classical group-equivariance. Particularly, fully-connected networks are *not* group-equivariant *but* are joint-group-equivariant. Our main theorem also unifies the universal approximation theorems for both shallow and deep networks. Until this study, the universality of deep networks has been shown in a different manner from the universality of shallow networks, but our results discuss them on common ground. Now we can understand the approximation schemes of various learning machines in a unified manner. As applications, we show the constructive universal approximation properties of four examples: depth-$n$ joint-equivariant machine, depth-$n$ fully-connected network, depth-$n$ group-convolutional network, and a new depth-$2$ network with quadratic forms whose universality has not been known.
Paperid:2355
Authors:Yingying Deng · Xiangyu He · Changwang Mei · Peisong Wang · Fan Tang
Abstract:
Abstract:Though Rectified Flows (ReFlows) with distillation offer a promising way for fast sampling, its fast inversion transforms images back to structured noise for recovery and following editing remains unsolved. This paper introduces FireFlow, a simple yet effective zeroshot approach that inherits the startling capacity of ReFlow-based models (such as FLUX) in generation while extending its capabilities to accurate inversion and editing in \textbf{8} steps. We first demonstrate that a carefully designed numerical solver is pivotal for ReFlow inversion, enabling accurate inversion and reconstruction with the precision of a second-order solver while maintaining the practical efficiency of a first-order Euler method. This solver achieves a $3\times$ runtime speedup compared to state-of-the-art ReFlow inversion and editing techniques while delivering smaller reconstruction errors and superior editing results in a training-free mode.
Paperid:2356
Authors:Hohyun Kim · Seunggeun Lee · Min-hwan Oh
Abstract:
Generative Flow Networks (GFlowNets) offer a powerful framework for sampling graphs in proportion to their rewards. However, existing approaches suffer from systematic biases due to inaccuracies in state transition probability computations. These biases, rooted in the inherent symmetries of graphs, impact both atombased and fragment-based generation schemes. Symmetry is particularly crucial in graph-based applications such as drug discovery, where molecular structure and function are tightly linked. To address this challenge, we introduce Symmetry-Aware GFlowNets (SA-GFN), a method that incorporates symmetry corrections into the learning process through reward scaling. By integrating bias correction directly into the reward structure, SA-GFN eliminates the need for explicit state transition computations. Empirical results show that SA-GFN enables unbiased sampling while enhancing diversity and consistently generating high-reward graphs that closely match the target distribution.
Paperid:2357
Authors:Fangming Cui · Yonggang Zhang · Xuan Wang · Xinmei Tian · Jun Yu
Abstract:
Recent developments in parameter optimization have significantly improved performance in targetspecific tasks. However, these partial parameter optimizing methods often struggle to tackle the target-unspecific tasks effectively. It may be attributed to the fact that overfitting training causes the model to forget its general knowledge having strong promotion on target-unspecific tasks. To alleviate this issue, we propose a novel Features Matrix (FM) regularization approach designed to enhance these models on target-unspecific tasks. Our method extracts and leverages general knowledge, shaping a Features Matrix (FM). Specifically, the FM captures the semantics of diverse inputs from a deep and fine perspective, preserving essential general knowledge, which mitigates the risk of overfitting. Representative evaluations demonstrate that: 1) the FM is compatible with existing frameworks as a generic and flexible module, and 2) the FM significantly showcases its effectiveness in enhancing target-unspecific tasks, achieving state-of-the-art performance.
Paperid:2358
Authors:Kun Cheng · Xiao He · Lei Yu · Zhijun Tu · Mingrui Zhu · Nannan Wang · Xinbo Gao · Jie Hu
Abstract:
Diffusion models have transformed generative modeling but suffer from scalability limitations due to computational overhead and inflexible architectures that process all generative stages and tokens uniformly. In this work, we introduce DiffMoE, a novel framework that combines Diffusion Transformers with Mixture-of-Experts to exploit both temporarily adaptability and spatial flexiblity. Our design incorporates expert-specific timestep conditioning, allowing each expert to process different spatial tokens while adapting to the generative stage, to dynamically allocate resources based on both the temporal and spatial characteristics of the generative task. Additionally, we propose a globally-aware feature recalibration mechanism that amplifies the representational capacity of expert modules by dynamically adjusting feature contributions based on input relevance. Extensive experiments on image generation benchmarks demonstrate that Diff-MoE significantly outperforms state-of-the-art methods. Our work demonstrates the potential of integrating diffusion models with expert-based designs, offering a scalable and effective framework for advanced generative modeling.
Paperid:2359
Authors:Yassine Abbahaddou · Fragkiskos Malliaros · Johannes Lutzeyer · Amine Aboussalah · Michalis Vazirgiannis
Abstract:
Graph Neural Networks (GNNs) have shown great promise in tasks like node and graph classification, but they often struggle to generalize, particularly to unseen or outof-distribution (OOD) data. These challenges are exacerbated when training data is limited in size or diversity. To address these issues, we introduce a theoretical framework using Rademacher complexity to compute a regret bound on the generalization error and then characterize the effect of data augmentation. This framework informs the design of GMM-GDA, an efficient graph data augmentation (GDA) algorithm leveraging the capability of Gaussian Mixture Models (GMMs) to approximate any distribution. Our approach not only outperforms existing augmentation techniques in terms of generalization but also offers improved time complexity, making it highly suitable for real-world applications.
Paperid:2360
Authors:Ramya Ramalingam · Shayan Kiyani · Aaron Roth
Abstract:
Existing algorithms for online conformal predictionguaranteeing marginal coverage in adversarial settings---are variants of online gradient descent (OGD), but their analyses do not follow from the regret guarantee of OGD. What is the relationship between no-regret learning and online conformal prediction? We show that although standard regret guarantees imply marginal coverage in i.i.d. settings, this connection fails as soon as we either move to adversarial environments or ask for group conditional coverage. On the other hand, we show a tight connection betweenthreshold calibratedcoverage and swap-regret in adversarial settings, which extends to group-conditional (multi-valid) coverage. We also show that algorithms in thefollow the perturbed leaderfamily of no regret learning algorithms (which includes online gradient descent) can be used to give marginal coverage guarantees in adversarial settings, and show how to extend this to group-conditional coverage guarantees. Via this connection we analyze and conduct experiments using a multi-group generalization of the ACI algorithm of Gibbs & Candes (2021).
Paperid:2361
Authors:Samuel Holt · Dhruva Tirumala · Todor Davchev · Ben Moran · Atil Iscen · Antoine Laurens · Yixin Lin · Erik Frey · Francesco Romano · Markus Wulfmeier · Nicolas Heess
Abstract:
Highfrequency control in continuous action and state spaces is essential for practical applications in the physical world. Directly applying end-to-end reinforcement learning to high-frequency control tasks struggles with assigning credit to actions across long temporal horizons, compounded by the difficulty of efficient exploration. The alternative, learning low-frequency policies that guide higher-frequency controllers (e.g., proportional-derivative (PD) controllers), can result in a limited total expressiveness of the combined control system, hindering overall performance. We introduceEvoControl, a novel bi-level policy learning framework for learning both a slow high-level policy (using PPO) and a fast low-level policy (using Evolution Strategies) for solving continuous control tasks. Learning with Evolution Strategies for the lower-policy allows robust learning for long horizons that crucially arise when operating at higher frequencies. This enablesEvoControlto learn to control interactions at a high frequency, benefitting from more efficient exploration and credit assignment than direct high-frequency torque control without the need to hand-tune PD parameters. We empirically demonstrate thatEvoControlcan achieve a higher evaluation reward for continuous-control tasks compared to existing approaches, specifically excelling in tasks where high-frequency control is needed, such as those requiring safety-critical fast reactions.
Paperid:2362
Authors:Osman Berke Guney · Ketan Saichandran · Karim Elzokm · Ziming Zhang · Vijaya Kolachalama
Abstract:
In many practical applications, including medicine, acquiring all relevant data for machine learning models is often infeasible due to constraints on time, cost, and resources. This makes it important to selectively acquire only the most informative features, yet traditional static feature selection methods fall short in scenarios where feature importance varies across instances. Here, we propose an active feature acquisition (AFA) framework, which dynamically selects features based on their importance to each individual case. Our method leverages local explanation techniques to generate instancespecific feature importance rankings. We then reframe the AFA problem as a feature prediction task, introducing a policy network grounded in a decision transformer architecture. This policy network is trained to select the next most informative feature by learning from the feature importance rankings. As a result, features are acquired sequentially, ordered by their predictive significance, leading to more efficient feature selection and acquisition. Extensive experiments on multiple datasets demonstrate that our approach outperforms current state-of-the-art AFA methods in both predictive accuracy and feature acquisition efficiency. These findings highlight the promise of an explainability-driven AFA strategy in scenarios where the cost of feature acquisition is a key concern.
Paperid:2363
Authors:Anh Duc Nguyen · Ilia Markov · Zhengqing Wu · Ali Ramezani-Kebrya · Kimon Antonakopoulos · Dan Alistarh · Volkan Cevher
Abstract:
Abstract:Modern deep neural networks exhibit heterogeneity across numerous layers of various types such as residuals, multihead attention, etc., due to varying structures (dimensions, activation functions, etc.), distinct representation characteristics, which impact predictions. We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training. We then apply a new layer-wise quantization technique within distributed variational inequalities (VIs), proposing a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs. We empirically show that QODA achieves up to a $150$% speedup over the baselines in end-to-end training time for training Wasserstein GAN on $12+$ GPUs.
Paperid:2364
Authors:Heng Dong · Kefei Duan · Chongjie Zhang
Abstract:
Large Language Models (LLMs) have achieved remarkable advancements in natural language processing tasks, yet they encounter challenges in complex decisionmaking scenarios that require long-term reasoning and alignment with high-level objectives. Existing methods either rely on short-term auto-regressive action generation or face limitations in accurately simulating rollouts and assessing outcomes, leading to sub-optimal decisions. This paper introduces a novel LLM-based Actor-Critic framework, termed LAC, that effectively improves LLM policies with long-term action evaluations in a principled and scalable way. Our approach addresses two key challenges: (1) extracting robust action evaluations by computing Q-values via token logits associated with positive/negative outcomes, enhanced by future trajectory rollouts and reasoning; and (2) enabling efficient policy improvement through a gradient-free mechanism. Experiments across diverse environments -- including high-level decision-making (ALFWorld), low-level action spaces (BabyAI-Text), and large action spaces (WebShop) -- demonstrate the framework’s generality and superiority over state-of-the-art methods. Notably, our approach achieves competitive performance using 7B/8B parameter LLMs, even outperforming baseline methods employing GPT-4 in complex tasks. These results underscore the potential of integrating structured policy optimization with LLMs’ intrinsic knowledge to advance decision-making capabilities in multi-step environments.
Paperid:2365
Authors:Na Lee · Konstantinos Barmpas · Yannis Panagakis · Dimitrios Adamos · Nikolaos Laskaris · Stefanos Zafeiriou
Abstract:
Foundation Models have demonstrated significant success across various domains in Artificial Intelligence (AI), yet their capabilities for brainwave modeling remain unclear. In this paper, we comprehensively evaluate current Large Brainwave Foundation Models (LBMs) through systematic finetuning experiments across multiple Brain-Computer Interface (BCI) benchmark tasks, including memory tasks and sleep stage classification. Our extensive analysis shows that state-of-the-art LBMs achieve only marginal improvements (0.5\%) over traditional deep architectures while requiring significantly more parameters (millions vs thousands), raising important questions about their efficiency and applicability in BCI contexts. Moreover, through detailed ablation studies and Low-Rank Adaptation (LoRA), we significantly reduce trainable parameters without performance degradation, while demonstrating that architectural and training inefficiencies limit LBMs' current capabilities. Our experiments span both full model fine-tuning and parameter-efficient adaptation techniques, providing insights into optimal training strategies for BCI applications. We pioneer the application of LoRA to LBMs, revealing that performance benefits generally emerge when adapting multiple neural network components simultaneously. These findings highlight the critical need for domain-specific development strategies to advance LBMs, suggesting that current architectures may require redesign to fully leverage the potential of foundation models in brainwave analysis.
Paperid:2366
Authors:Ryan Liu · Jiayi Geng · Addison J. Wu · Ilia Sucholutsky · Tania Lombrozo · Thomas Griffiths
Abstract:
Chainof-thought (CoT) prompting has become a widely used strategy for improving large language and multimodal model performance. However, it is still an open question in what settings CoT systematicallyreducesperformance. In this paper, we seek to identify the characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology, focusing on six representative tasks from the psychological literature where deliberation hurts performance in humans. In three of these tasks, state-of-the-art models exhibit significant performance drop-offs with CoT (up to 36.3\% absolute accuracy for OpenAI o1-preview compared to GPT-4o), while in others, CoT effects are mixed, with positive, neutral, and negative changes. While models and humans do not exhibit perfectly parallel cognitive processes, considering cases where thinking has negative consequences for humans helps identify settings where it negatively impacts models. By connecting the literature on human verbal thinking and deliberation with evaluations of CoT, we offer a perspective for understanding the impact of inference-time reasoning.
Paperid:2367
Authors:Thomas Schmied · Thomas Adler · Vihang Patil · Maximilian Beck · Korbinian Pöppel · Johannes Brandstetter · Günter Klambauer · Razvan Pascanu · Sepp Hochreiter
Abstract:
In recent years, there has been a trend in the field of Reinforcement Learning (RL) towards large action models trained offline on largescale datasets via sequence modeling. Existing models are primarily based on the Transformer architecture, which result in powerful agents. However, due to slow inference times, Transformer-based approaches are impractical for real-time applications, such as robotics. Recently, modern recurrent architectures, such as xLSTM and Mamba, have been proposed that exhibit parallelization benefits during training similar to the Transformer architecture while offering fast inference. In this work, we study the aptitude of these modern recurrent architectures for large action models. Consequently, we propose a Large Recurrent Action Model (LRAM) with an xLSTM at its core that comes with linear-time inference complexity and natural sequence length extrapolation abilities. Experiments on 432 tasks from 6 domains show that LRAM compares favorably to Transformers in terms of performance and speed.
Paperid:2368
Authors:Ilias Diakonikolas · Daniel Kane · Jasper C.H. Lee · Thanasis Pittas · David Woodruff · Samson Zhou
Abstract:
Abstract:We study the problem of distributed distinct element estimation, where $\alpha$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+\varepsilon)$approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $\Theta\left(\alpha\log n+\frac{\alpha}{\varepsilon^2}\right)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = \frac{\beta}{\varepsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $O\left(\alpha\log n\log\log n+\frac{\sqrt{\beta}}{\varepsilon^2} \log n\right)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.
Paperid:2369
Authors:Gabriele DAcunto · Fabio Massimo Zennaro · Yorgos Felekis · Paolo Di Lorenzo
Abstract:
Structural causal models (SCMs) allow us to investigate complex systems at multiple levels of resolution.The causal abstraction (CA) framework formalizes the mapping between highand low-level SCMs. We address CA learning in a challenging and realistic setting, where SCMs are inaccessible, interventional data is unavailable, and sample data is misaligned.A key principle of our framework issemantic embedding, formalized as the high-level distribution lying on a subspace of the low-level one. This principle naturally links linear CA to the geometry of theStiefel manifold.We present a category-theoretic approach to SCMs that enables the learning of a CA by finding a morphism between the low- and high-level probability measures, adhering to the semantic embedding principle.Consequently, we formulate a general CA learning problem.As an application, we solve the latter problem for linear CA; considering Gaussian measures and the Kullback-Leibler divergence as an objective.Given the nonconvexity of the learning task, we develop three algorithms building upon existing paradigms for Riemannian optimization.We demonstrate that the proposed methods succeed on both synthetic and real-world brain data with different degrees of prior information about the structure of CA.
Paperid:2370
Authors:Yuxin Zuo · Shang Qu · Yifei Li · Zhang-Ren Chen · Xuekai Zhu · Ermo Hua · Kaiyan Zhang · Ning Ding · Bowen Zhou
Abstract:
Abstract:We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expertlevel medical knowledge and advanced reasoning. MedXpertQA includes $4,460$ questions spanning $17$ specialties and $11$ body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate $16$ leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.
Paperid:2371
Authors:Yaopei Zeng · Yuanpu Cao · Bochuan Cao · Yurui Chang · Jinghui Chen · Lu Lin
Abstract:
Recent advances in diffusion models have significantly enhanced the quality of image synthesis, yet they have also introduced serious safety concerns, particularly the generation of Not Safe for Work (NSFW) content. Previous research has demonstrated that adversarial prompts can be used to generate NSFW content. However, such adversarial text prompts are often easily detectable by textbased filters, limiting their efficacy. In this paper, we expose a previously overlooked vulnerability: adversarial image attacks targeting Image-to-Image (I2I) diffusion models. We propose AdvI2I, a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD), without altering the text prompts. Furthermore, we introduce AdvI2I-Adaptive, an enhanced version that adapts to potential countermeasures and minimizes the resemblance between adversarial images and NSFW concept embeddings, making the attack more resilient against defenses. Through extensive experiments, we demonstrate that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards, highlighting the urgent need for stronger security measures to address the misuse of I2I diffusion models.
Paperid:2372
Authors:Xi Wang · Laurence Aitchison
Abstract:
The scaling of the optimal AdamW weight decay hyperparameter with model and dataset size is critical as we seek to build larger models, but is poorly understood. We show that weights learned by AdamW can be understood as an exponential moving average (EMA) of recent updates. This gives critical insights for how to set the weight decay in AdamW, and how the weight decay should scale with model and dataset size. In particular, the key hyperparameter for an exponential moving average is the EMA timescale. Intuitively, the EMA timescale can be understood as the number of recent iterations the EMA averages over. We find that the optimal timescale, measured in epochs, is roughly constant as we change model and dataset size. Moreover, given a learning rate, there is a oneto-one mapping from the EMA timescale to the weight decay hyperparameter. Thus, if the optimal EMA timescale is constant, that implies that as the dataset size increases, the optimal weight decay should fall and as the model size increases, the optimal weight decay should increase (if we follow the muP recommendation for scaling the learning rate). We validate these scaling rules on ResNet-18 and Vision Transformers trained on CIFAR-10 and ImageNet, and on NanoGPT pre-training on OpenWebText. Finally, we found that as training progresses, muP's learning rate scaling breaks down for AdamW unless weight decay is scaled appropriately.
Paperid:2373
Authors:Gleb Rodionov · Liudmila Prokhorenkova
Abstract:
Neural algorithmic reasoning aims to capture computations with neural networks via learning the models to imitate the execution of classic algorithms. While common architectures are expressive enough to contain the correct model in the weights space, current neural reasoners are struggling to generalize well on outof-distribution data. On the other hand, classic computations are not affected by distributional shifts as they can be described as transitions between discrete computational states. In this work, we propose to force neural reasoners to maintain the execution trajectory as a combination of finite predefined states. To achieve that, we separate discrete and continuous data flows and describe the interaction between them. Trained with supervision on the algorithm's state transitions, such models are able to perfectly align with the original algorithm. To show this, we evaluate our approach on multiple algorithmic problems and get perfect test scores both in single-task and multitask setups. Moreover, the proposed architectural choice allows us to prove the correctness of the learned algorithms for any test data.
Paperid:2374
Authors:FENGYUN WANG · Sicheng Yu · Jiawei Wu · Jinhui Tang · Hanwang Zhang · Qianru Sun
Abstract:
Large visionlanguage models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks. Our code is in the Appendix.
Paperid:2375
Authors:Jean-Francois Ton · Muhammad Faaiz Taufiq · Yang Liu
Abstract:
Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through the use of Chainof-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short of accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information gain' at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy arithmetic, GSM8K and PRM800k datasets, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual tasks.
Paperid:2376
Authors:Henrik von Kleist · Joshua Wendland · Ilya Shpitser · Carsten Marr
Abstract:
Feature importance metrics are essential for understanding the relevance of individual features and ensuring the explainability of machine learning models. Since missing data frequently arises in realworld settings, it is important to evaluate feature importance unambiguously in such cases. We introduce the distinction between two evaluation frameworks for feature importance under missing data: (1) importance under the full data, assuming no missingness, and (2) importance under the observed data, reflecting the measurement policy that governs data collection. While the full data perspective offers insights into the data-generating process, it often relies on unrealistic assumptions and cannot guide decisions when missingness persists at deployment. Since neither framework directly informs improvements in data collection, we additionally introduce the feature measurement importance gradient (FMIG), a novel, model-agnostic metric that identifies which features should be measured more frequently to enhance predictive performance. Using synthetic data, we illustrate key differences between these metrics and the risks of conflating them.
Paperid:2377
Authors:Kaicheng Fu · Changde Du · Jie Peng · Kunpeng Wang · Shuangchen Zhao · Xiaoyu Chen · Huiguang He
Abstract:
Emotion recognition systems face significant challenges in realworld applications, where novel emotion categories continually emerge and multiple emotions often co-occur. This paper introduces multi-label fine-grained class incremental emotion decoding, which aims to develop models capable of incrementally learning new emotion categories while maintaining the ability to recognize multiple concurrent emotions. We propose an Augmented Emotional Semantics Learning (AESL) framework to address two critical challenges: past- and future-missing partial label problems. AESL incorporates an augmented Emotional Relation Graph (ERG) for reliable soft label generation and affective dimension-based knowledge distillation for future-aware feature learning. We evaluate our approach on three datasets spanning brain activity and multimedia domains, demonstrating its effectiveness in decoding up to 28 fine-grained emotion categories. Results show that AESL significantly outperforms existing methods while effectively mitigating catastrophic forgetting.
Paperid:2378
Authors:Han Zhong · Zikang Shan · Guhao Feng · Wei Xiong · Xinle Cheng · Li Zhao · Di He · Jiang Bian · Liwei Wang
Abstract:
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentencelevel rewards---a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still largely sub-optimal, as widely reported by numerous research studies. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Furthermore, we provide theoretical insights that demonstrate the superiority of our MDP framework over the previous sentence-level bandit formulation. Under this framework, we introduce an algorithm, dubbed as Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. We conduct extensive experiments to evaluate \texttt{RTO} against PPO and other direct preference learning algorithms. The results highlight the effectiveness of RTO, with the algorithm outperforming PPO by 7.5 points on the AlpacaEval 2 benchmark and by 4.1 points on Arena-Hard.
Paperid:2379
Authors:Haosen Ge · Hamsa Bastani · Osbert Bastani
Abstract:
Conformal prediction has emerged as an effective strategy for uncertainty quantification by modifying a model to output sets of labels instead of a single label. These prediction sets come with the guarantee that they contain the true label with high probability. However, conformal prediction typically requires a large calibration dataset of i.i.d. examples. We consider the online learning setting, where examples arrive over time, and the goal is to construct prediction sets dynamically. Departing from existing work, we assume semibandit feedback, where we only observe the true label if it is contained in the prediction set. For instance, consider calibrating a document retrieval model to a new domain; in this setting, a user would only be able to provide the true label if the target document is in the prediction set of retrieved documents. We propose a novel conformal prediction algorithm targeted at this setting, and prove that it obtains sublinear regret compared to the optimal conformal predictor. We evaluate our algorithm on a retrieval task, an image classification task, and an auction price-setting task, and demonstrate that it empirically achieves good performance compared to several baselines.
Paperid:2380
Authors:Antoine Dedieu · Joseph Ortiz · Xinghua Lou · Carter Wendelken · Wolfgang Lehrach · J Swaroop Guntupalli · Miguel Lazaro-Gredilla · Kevin Murphy
Abstract:
Abstract:We present an approach to modelbased RL that achieves a new state of the art performance on the challenging Craftax-classic benchmark, an open-world 2D survival game that requires agents to exhibit a wide range of general abilities---such as strong generalization, deep exploration, and long-term reasoning. With a series of careful design choices aimed at improving sample efficiency, our MBRL algorithm achieves a reward of 67.42% after only 1M environment steps, significantly outperforming DreamerV3, which achieves $53.2\%$, and, for the first time, exceeds human performance of 65.0%. Our method starts by constructing a SOTA model-free baseline, using a novel policy architecture that combines CNNs and RNNs.We then add three improvements to the standard MBRL setup: (a) "Dyna with warmup", which trains the policy on real and imaginary data, (b) "nearest neighbor tokenizer" on image patches, which improves the scheme to create the transformer world model (TWM) inputs, and (c) "block teacher forcing", which allows the TWM to reason jointly about the future tokens of the next timestep.
Paperid:2381
Authors:Luyang Liu · Jonas Pfeiffer · Jiaxing Wu · Jun Xie · Arthur Szlam
Abstract:
Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's keyvalue (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently improves performance across a range of reasoning-intensive tasks.
Paperid:2382
Authors:Yiyang Fang · Jian Liang · Wenke Huang · He Li · Kehua Su · Mang Ye
Abstract:
Multimodal large language models (MLLMs) have achieved impressive progress in tasks such as visual question answering and visual understanding, but they still face significant challenges in emotional reasoning. Current methods to enhance emotional understanding typically rely on finetuning or manual annotations, which are resource-intensive and limit scalability. In this work, we focus on improving the ability of MLLMs to capture emotions during the inference phase. Specifically, MLLMs encounter two main issues: they struggle to distinguish between semantically similar emotions, leading to misclassification, and they are overwhelmed by redundant or irrelevant visual information, which distracts from key emotional cues. To address these, we propose Sharpening Emotion Perception in MLLMs (SEPM), which incorporates a Confidence-Guided Coarse-to-Fine Inference framework to refine emotion classification by guiding the model through simpler tasks. Additionally, SEPM employs Focus-on-Emotion Visual Augmentation to reduce visual redundancy by directing the attention of models to relevant emotional cues in images. Experimental results demonstrate that SEPM significantly improves MLLM performance on emotion-related tasks, providing a resource-efficient and scalable solution for emotion recognition.
Paperid:2383
Authors:Kangjie Zheng · Junwei Yang · Siyue Liang · Bin Feng · Zequn Liu · Wei Ju · Zhiping Xiao · Ming Zhang
Abstract:
Abstract:Masked Language Models (MLMs) have achieved remarkable success in many selfsupervised representation learning tasks. MLMs are trained by randomly masking portions of the input sequences with $\texttt{[MASK]}$ tokens and learning to reconstruct the original content based on the remaining context. This paper explores the impact of $\texttt{[MASK]}$ tokens on MLMs. Analytical studies show that masking tokens can introduce the ***corrupted semantics*** problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands $\texttt{[MASK]}$ tokens in the input context and models the dependencies between these expanded states. This enhancement increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enriches semantic representations through context enhancement, and effectively reduces the semantic multimodality commonly observed in MLMs. Source codes of ExLM will be released upon publication.
Paperid:2384
Authors:Aditya Gorla · Ryan Wang · Zhengtong Liu · Ulzee An · Sriram Sankararaman
Abstract:
We present CACTI, a new masked autoencoding approach for imputing tabular data that leverages structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from patterns of missingness that are observed in the data. It also incorporates semantic relationships between features — captured by column names and text descriptions — to better represent feature dependence, reducing the reliance on learning feature relationships solely from limited observational data. These dual sources of inductive bias enable CACTI to outperform stateof-the-art methods —7.8% average gain in 10 datasets over the next best method (13.4%, 6.1%, and 5.3% and under missing not at random, missing at random and missing completely at random, respectively) — across a diverse range of open datasets and missingness conditions. Our results highlight the value of dataset-specific context and missingness patterns towards significantly enhancing imputation performance.
Paperid:2385
Authors:Yuhua Zhou · Ruifeng Li · Changhai Zhou · Fei Yang · Aimin PAN
Abstract:
LowRank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning method for large language models (LLMs) to adapt to downstream tasks. However, in scenarios where multiple LoRA models are deployed simultaneously, standard LoRA introduces substantial trainable parameters, resulting in significant memory overhead and inference latency, particularly when supporting thousands of downstream tasks on a single server. While existing methods reduce stored parameters via parameter sharing, they fail to capture both local and global information simultaneously. To address this issue, we propose the Bi-Share LoRA (BSLoRA), which extends local LoRA with intra-LoRA and inter-LoRA parameter sharing to better capture local and global information. This approach reduces trainable parameters while maintaining or even enhancing model performance. Additionally, we design three transformation methods to improve the compatibility and collaborative efficiency of shared parameters with varying shapes, enhancing overall adaptability.Experiments on the 7B, 8B, and 13B versions of Llama show that BSLoRA, with only 44.59% of the parameters of standard LoRA, outperforms LoRA by approximately 0.33% on commonsense reasoning and 2.08% on MMLU benchmarks.
Paperid:2386
Authors:Prasanna Mayilvahanan · Thaddäus Wiedemer · Sayak Mallick · Matthias Bethge · Wieland Brendel
Abstract:
Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute.More recently, lossto-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance.In this work, we investigate which factors most strongly influence loss-to-loss scaling.Our experiments reveal that the pretraining data and tokenizer determine the scaling trend.In contrast, model size, optimization hyperparameters, and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, have limited impact.Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.
Paperid:2387
Authors:Peng Wang · Yifu Lu · Yaodong Yu · Druv Pai · Qing Qu · Yi Ma
Abstract:
Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of lowdimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of only self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations at a linear rate with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer can already achieve performance close to that of standard transformer architectures such as GPT-2 and CRATE.
Paperid:2388
Authors:Jiaze Xu · Shiyu Xia · Xu Yang · JIAQI LYU · Xin Geng
Abstract:
Appropriate parameter initialization strategies are essential for reducing the high computational costs of training large pretrained models in various task scenarios. Graph HyperNetwork (GHN), a parameter initialization method, has recently demonstrated strong performance in initializing models.However, GHN still faces several challenges, including limited effectiveness in initializing larger models, poor performance on smaller datasets, and the requirement of taskspecific GHN training, where each new task necessitates retraining the GHN model, leading to increased computational and storage overhead.To overcome these challenges, motivated by the recently proposed Learngene framework, we propose a novel method calledTask-AwareLearngene (TAL). Briefly, our approach pretrains a TAL model under the guidance of a well-trained model and then performs multi-task tuning to obtain a shared TAL model that enables parameter prediction based on both model architectures and task-specific characteristics.Extensive experiments show the superiority of TAL.Models initialized with TAL outperform those initialized using GHN method by an average of 24.39\% in terms of accuracy across Decathlon datasets.
Paperid:2389
Authors:Yudong W Xu · Wenhao Li · Scott Sanner · Elias Khalil
Abstract:
CSPs find use in many applications and thus accelerating their solution with machine learning is of wide interest. Most existing approaches rely on supervised learning from feasible solutions or reinforcement learning, paradigms that require either feasible solutions to these NPComplete CSPs or large training budgets and a complex expert-designed reward signal.To address these challenges, we propose ConsFormer, a self-supervised framework that leverages a Transformer as a solution refiner. ConsFormer constructs a solution to a CSP iteratively in a process that mimics local search. Instead of using feasible solutions as labeled data, we devise differentiable approximations to the discrete constraints of a CSP to guide model training. Our model is trained to improve random assignments for a single step but is deployed iteratively at test time, circumventing the bottlenecks of supervised and reinforcement learning. Our method can tackle out-of-distribution CSPs simply through additional iterations.
Paperid:2390
Authors:Yue Liu · Xiaoxin He · Miao Xiong · Jinlan Fu · Shumin Deng · YINGWEI MA · Jiaheng Zhang · Bryan Hooi
Abstract:
Abstract:This paper proposes a simple yet effective jailbreak attack named FlipAttack against blackbox LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when the perturbation is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing a left-side perturbation merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task and then develop 4 variants to guide LLMs to understand and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves $\sim$78.97\% attack success rate across 8 LLMs on average and $\sim$98\% bypass rate against 5 guard models on average.
Paperid:2391
Authors:Michael Sun · Gang Liu · Weize Yuan · Wojciech Matusik · Jie Chen
Abstract:
Recent dataefficient molecular generation approaches exploit graph grammars to introduce interpretability into the generative models. However, grammar learning therein relies on expert annotation or unreliable heuristics for algorithmic inference. We propose Foundation Molecular Grammar (FMG), which leverages multi-modal foundation models (MMFMs) to induce an interpretable molecular language. By exploiting the chemical knowledge of an MMFM, FMG renders molecules as images, describes them as text, and aligns information across modalities using prompt learning. FMG can be used as a drop-in replacement for the prior grammar learning approaches in molecular generation and property prediction. We show that FMG not only excels in synthesizability, diversity, and data efficiency but also offers built-in chemical interpretability for automated molecular discovery workflows.
Paperid:2392
Authors:Ningyi Liao · Zihao Yu · Ruixiao Zeng · Siqiang Luo
Abstract:
Graph Neural Networks (GNNs) have shown promising performance, but at the cost of resourceintensive operations on graph-scale matrices. To reduce computational overhead, previous studies attempt to sparsify the graph or network parameters, but with limited flexibility and precision boundaries. In this work, we propose Unifews, a joint sparsification technique to unify graph and weight matrix operations and enhance GNN learning efficiency. The Unifews design enables adaptive compression across GNN layers with progressively increased sparsity, and is applicable to a variety of architectures with on-the-fly simplification. Theoretically, we establish a novel framework to characterize sparsified GNN learning in view of the graph optimization process, showing that Unifews effectively approximates the learning objective with bounded error and reduced computational overhead. Extensive experiments demonstrate that Unifews achieves efficiency improvements with comparable or better accuracy, including 10-20x matrix operation reduction and up to 100x acceleration on graphs up to billion-edge scale.
Paperid:2393
Authors:Jianqing Zhang · Yang Liu · Jie Fu · Yang Hua · Tianyuan Zou · Jian Cao · Qiang Yang
Abstract:
The rise of generative APIs has fueled interest in privacypreserving synthetic data generation. While the Private Evolution (PE) algorithm generates Differential Privacy (DP) synthetic images using diffusion model APIs, it struggles with few-shot private data due to the limitations of its DP-protected similarity voting approach. In practice, the few-shot private data challenge is particularly prevalent in specialized domains like healthcare and industry. To address this challenge, we propose a novel API-assisted algorithm, Private Contrastive Evolution (PCE), which iteratively mines inherent inter-class contrastive relationships in few-shot private data beyond individual data points and seamlessly integrates them into an adapted Exponential Mechanism (EM) to optimize DP’s utility in an evolution loop. We conduct extensive experiments on four specialized datasets, demonstrating that PCE outperforms PE and other API-assisted baselines. These results highlight the potential of leveraging API access with private data for quality evaluation, enabling the generation of high-quality DP synthetic images and paving the way for more accessible and effective privacy-preserving generative API applications. Our code is available at https://anonymous.4open.science/r/5F1A.
Paperid:2394
Authors:Konstantina Bairaktari · Jiayun Wu · Steven Wu
Abstract:
Conformal prediction is a powerful distributionfree framework for constructing prediction sets with coverage guarantees. Classical methods, such as split conformal prediction, providemarginal coverage, ensuring that the prediction set contains the label of a random test point with a target probability. However, these guarantees may not hold uniformly across different subpopulations, leading to disparities in coverage. Prior work has explored coverage guarantees conditioned on events related to the covariates and label of the test point. We presentKandinsky conformal prediction, a framework that significantly expands the scope of conditional coverage guarantees. In contrast to Mondrian conformal prediction, which restricts its coverage guarantees to disjoint groups—reminiscent of the rigid, structured grids of Piet Mondrian’s art—our framework flexibly handles overlapping and fractional group memberships defined jointly on covariates and labels, reflecting the layered, intersecting forms in Wassily Kandinsky’s compositions. Our algorithm unifies and extends existing methods, encompassing covariate-based group conditional, class conditional, and Mondrian conformal prediction as special cases, while achieving a minimax-optimal high-probability conditional coverage bound. Finally, we demonstrate the practicality of our approach through empirical evaluation on real-world datasets.
Paperid:2395
Authors:Soobin Um · Beomsu Kim · Jong Chul YE
Abstract:
Minority samples are underrepresented instances located in lowdensity regions of a data manifold, and are valuable in many generative AI applications, such as data augmentation, creative content generation, etc. Unfortunately, existing diffusion-based minority generators often rely on computationally expensive guidance dedicated for minority generation. To address this, here we present a simple yet powerful guidance-free approach calledBoost-and-Skipfor generating minority samples using diffusion models. The key advantage of our framework requires only two minimal changes to standard generative processes: (i) variance-boosted initialization and (ii) timestep skipping. We highlight that these seemingly-trivial modifications are supported by solid theoretical and empirical evidence, thereby effectively promoting emergence of underrepresented minority features. Our comprehensive experiments demonstrate that Boost-and-Skip greatly enhances the capability of generating minority samples, even rivaling guidance-based state-of-the-art approaches while requiring significantly fewer computations.
Paperid:2396
Authors:Rui Sun · Huayu Mai · Wangkai Li · Yujia Chen · Naisong Luo · Yuan Wang · Tianzhu Zhang
Abstract:
The critical challenge of semisupervised semantic segmentation lies in how to fully exploit a large volume of unlabeled data to improve the model's generalization performance for robust segmentation. Existing methods mainly rely on confidence-based scoring functions in the prediction space to filter pseudo labels, which suffer from the inherent trade-off between true and false positive rates. In this paper, we carefully design an agent construction strategy to build clean sets of correct (positive) and incorrect (negative) pseudo labels, and propose the Agent Score function (AgScore) to measure the consensus between candidate pixels and these sets. In this way, AgScore takes a step further to capture homogeneous patterns in the embedding space, conditioned on clean positive/negative agents stemming from the prediction space, without sacrificing the merits of confidence score, yielding better trad-off. We provide theoretical analysis to understand the mechanism of AgScore, and demonstrate its effectiveness by integrating it into three semi-supervised segmentation frameworks on Pascal VOC, Cityscapes, and COCO datasets, showing consistent improvements across all data partitions. Code and models will be made available to facilitate future research.
Paperid:2397
Authors:Can Chen · Karla-Luise Herpoldt · Chenchao Zhao · Zichen Wang · Marcus Collins · Shang Shang · Ron Benson
Abstract:
Antibodies are widely used as therapeutics, but their development requires costly affinity maturation, involving iterative mutations to enhance binding affinity. This paper explores a sequenceonly scenario for affinity maturation, using solely antibody and antigen sequences. Recently AlphaFlow wraps AlphaFold within flow matching to generate diverse protein structures, enabling a sequence-conditioned generative model of structure. Building on this, we propose an \textit{alternating optimization} framework that (1) fixes the sequence to guide structure generation toward high binding affinity using a structure-based predictor, then (2) applies inverse folding to create sequence mutations, refined by a sequence-based predictor. A key challenge is the lack of labeled data for training both predictors. To address this, we develop a \textit{co-teaching} module that incorporates valuable information from noisy biophysical energies into predictor refinement. The sequence-based predictor selects consensus samples to teach the structure-based predictor, and vice versa. Our method, \textit{AffinityFlow}, achieves state-of-the-art performance in affinity maturation experiments.
Paperid:2398
Authors:Boyu Zhang · Aocheng Shen · Bing Liu · Qiankun Zhang · Bin Yuan · Wang · Shenghao Liu · Xianjun Deng
Abstract:
We explore the potential of \emph{AIenhanced combinatorial optimization theory}, taking online bipartite matching (OBM) as a case study.In the theoretical study of OBM, the \emph{hardness} corresponds to a performance \emph{upper bound} of a specific online algorithm or any possible online algorithms.Typically, these upper bounds derive from challenging instances meticulously designed by theoretical computer scientists.Zhang et al. (ICML 2024) recently provide an example demonstrating how reinforcement learning techniques enhance the hardness result of a specific OBM model.Their attempt is inspiring but preliminary.It is unclear whether their methods can be applied to other OBM problems with similar breakthroughs.This paper takes a further step by introducing DiMa, a unified and novel framework that aims at understanding the hardness of OBM problems based on denoising diffusion probabilistic models (DDPMs).DiMa models the process of generating hard instances as denoising steps, and optimizes them by a novel reinforcement learning algorithm, named \emph{shortcut policy gradient} (SPG).We first examine DiMa on the classic OBM problem by reproducing its known hardest input instance in literature.Further, we apply DiMa to two well-known variants of OBM, for which the exact hardness remains an open problem, and we successfully improve their theoretical state-of-the-art upper bounds.
Paperid:2399
Authors:Ahmed Boughdiri · Erwan Scornet · julie Josse
Abstract:
The Risk Difference (RD), an absolute measure of effect, is widely used and wellstudied in both randomized controlled trials (RCTs) and observational studies. Complementary to the RD, the Risk Ratio (RR), as a relative measure, is critical for a comprehensive understanding of intervention effects: RD can downplay small absolute changes, while RR can highlight them. Despite its significance, the theoretical study of RR has received less attention, particularly in observational settings. This paper addresses this gap by tackling the estimation of RR in observational data. We propose several RR estimators and establish their theoretical properties, including asymptotic normality and confidence intervals. Through analyses on simulated and real-world datasets, we evaluate the performance of these estimators in terms of bias, efficiency, and robustness to generative data models. We also examine the coverage and length of the associated confidence intervals. Due to the non-linear nature of RR, influence function theory yields two distinct efficient estimators with different convergence assumptions. Based on theoretical and empirical insights, we recommend, among all estimators, one of the two doubly-robust estimators, which, intriguingly, challenges conventional expectations.
Paperid:2400
Authors:Jake Robertson · Noor Awad · Noah Hollmann · Frank Hutter · Samuel Gabriel Müller
Abstract:
Machine learning (ML) systems are utilized in critical sectors such as healthcare, law enforcement, and finance, but often rely on historical data that contains demographic biases, leading to decisions that perpetuate or intensify existing inequalities. Causal and counterfactual fairness provide a transparent, humanin-the-loop framework to mitigate algorithmic discrimination, aligning closely with legal doctrines of direct and indirect discrimination. However, current causal fairness frameworks hold a key limitation in that they assume prior knowledge of the correct causal model, restricting their applicability in complex fairness scenarios where causal models are unknown or difficult to identify. To bridge this gap, we propose FairPFN, a tabular foundation model pre-trained on synthetic causal fairness data to identify and mitigate the causal effects of protected attributes in its predictions. FairPFN's key contribution is that it requires no knowledge of the causal model and demonstrates strong performance across a diverse set of hand-crafted and real-world causal scenarios relative to robust baseline methods. FairPFN paves the way for a promising direction for future research, making causal fairness more accessible to a wider variety of complex fairness problems.
Paperid:2401
Authors:Jimmy Wang · Tom Zollo · Richard Zemel · Hongseok Namkoong
Abstract:
Eliciting information to reduce uncertainty on a latent entity is a critical skill in many application domains, e.g., assessing individual student learning outcomes, diagnosing underlying diseases, or learning user preferences.Though natural language is a powerful medium for this purpose, large language models (LLMs) and existing finetuning algorithms lack mechanisms for strategically gathering information to refine their own understanding of the latent entity.We propose an adaptive elicitation framework that actively}reduces uncertainty on the latent entityby simulating counterfactual responses. Since probabilistic modeling of an abstract latent entity is difficult, we validate and finetune LLM-based uncertainty quantification methods using perplexity over masked future observations produced by the latent entity.Our framework enables the development of sophisticated information-gathering strategies, and we demonstrate its versatility through experiments on dynamic opinion polling and adaptive student assessment.
Paperid:2402
Authors:Liang Huang · Benedict Lee · Daniel Ng · Kelin Xia
Abstract:
Text classification is a fundamental task in Natural Language Processing (NLP). Short text classification has recently captured much attention due to its increased amount from various sources with limited labels and its inherent challenges for its sparsity in words and semantics. Recent studies have adopted selfsupervised contrastive learning across different representations to improve performance. However, most of the current models face several challenges. Firstly, the augmentation step might not be able to generate positive and negative samples that are semantically similar and dissimilar to the anchor respectively. Secondly, the text data could be enhanced with external auxiliary information that might introduce noise to the sparse text data. In addition, they are limited in capturing higher-order information such as group-wise interactions. In this work, we propose a novel document simplicial complex construction based on text data for a higher-order message-passing mechanism. We enhance the short text classification performance by contrasting the structural representation with the sequential representation generated by the transformer mechanism for improved outcomes and mitigated issues. The proposed framework, Contrastive Learning with Simplicial Convolutional Networks (C-SCN), leverages the expressive power of graph neural networks, models higher-order information beyond pair-wise relations and enriches features through contrastive learning. Experimental results on four benchmark datasets demonstrate the capability of C-SCN to outperform existing models in analysing sequential and complex short-text data.
Paperid:2403
Authors:Pedro Santos · Alberto Sardinha · Francisco S. Melo
Abstract:
The generalutility Markov decision processes (GUMDPs) framework generalizes the MDPs framework by considering objective functions that depend on the frequency of visitation of state-action pairs induced by a given policy. In this work, we contribute with the first analysis on the impact of the number of trials, i.e., the number of randomly sampled trajectories, in infinite-horizon GUMDPs. We show that, as opposed to standard MDPs, the number of trials plays a key-role in infinite-horizon GUMDPs and the expected performance of a given policy depends, in general, on the number of trials. We consider both discounted and average GUMDPs, where the objective function depends, respectively, on discounted and average frequencies of visitation of state-action pairs. First, we study policy evaluation under discounted GUMDPs, proving lower and upper bounds on the mismatch between the finite and infinite trials formulations for GUMDPs. Second, we address average GUMDPs, studying how different classes of GUMDPs impact the mismatch between the finite and infinite trials formulations. Third, we provide a set of empirical results to support our claims, highlighting how the number of trajectories and the structure of the underlying GUMDP influence policy evaluation.
Paperid:2404
Authors:Clément Bonet · Christophe Vauthier · Anna Korba
Abstract:
Many applications in machine learning involve data represented as probability distributions. The emergence of such data requires radically novel techniques to design tractable gradient flows on probability distributions over this type of (infinitedimensional) objects. For instance, being able to flow labeled datasets is a core task for applications ranging from domain adaptation to transfer learning or dataset distillation. In this setting, each class can be represented by the associated conditional distribution of features, and the dataset is a mixture distribution supported on these classes (which are themselves probability distributions), meaning that labeled datasets can be seen as probability distributions over probability distributions. We endow this space with a metric structure from optimal transport, namely the Wasserstein over Wasserstein (WoW) distance, derive a differential structure on this space, and define WoW gradient flows. The latter enables to design dynamics over this space that decrease a given objective functional. We apply our framework to transfer learning and dataset distillation tasks, leveraging our gradient flow construction as well as novel tractable functionals that take the form of Maximum Mean Discrepancies with Sliced-Wasserstein based kernels between probability distributions.
Paperid:2405
Authors:Rylan Schaeffer · Hailey Schoelkopf · Brando Miranda · Gabriel Mukobi · Varun Madan · Adam Ibrahim · Herbie Bradley · Stella Biderman · Sanmi Koyejo
Abstract:
Predictable behavior from scaling advanced AI systems is an extremely desirable property for engineers, companies, economists and governments alike, and while a wellestablished literature exists on how pretraining performance scales, predictable scaling behavior on downstream capabilities remains elusive. While many factors are certainly responsible, this paper shines a light on a significant factor that makes predicting scaling behavior on widely used multiple-choice question answering benchmarks challenging and illuminates a path towards making such downstream evaluations predictable with scale. Using five model families and twelve well-established multiple-choice benchmarks, we show that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively degrades the statistical relationship between performance and scale. We then reveal the mechanism causing this degradation: downstream metrics require comparing the correct choice against a small number of specific incorrect choices, meaning accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on specific incorrect choices with scale. We empirically study how probability mass on the correct choice co-varies with probability mass on incorrect choices with increasing compute, suggesting that scaling laws for \textit{incorrect} choices might be achievable. Our work also explains why pretraining scaling laws are commonly regarded as more predictable than downstream capabilities and contributes towards establishing scaling-predictable evaluations of frontier AI models.
Paperid:2406
Authors:Andrei Panferov · Jiale Chen · Soroush Tabesh · Mahdi Nikdan · Dan Alistarh
Abstract:
One main approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment.While posttraining compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e. Quantization-Aware Training (QAT), is still open: a recent study put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We significantly advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, that is, it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations, via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces new, stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations as well. Moreover, we provide GPU kernel support showing that the models produced by QuEST can be efficiently executed on current hardware.
Paperid:2407
Authors:Armin Behnamnia · Gholamali Aminian · Alireza Aghaei · Chengchun Shi · Vincent Tan · Hamid R Rabiee
Abstract:
Abstract:Offpolicy learning and evaluation scenarios leverage logged bandit feedback datasets, which contain context, action, propensity score, and feedback for each data point. These scenarios face significant challenges due to high variance and poor performance with low-quality propensity scores and heavy-tailed reward distributions. We address these issues by introducing a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators. our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions. For off-policy evaluation, we derive upper bounds on the estimator's bias and variance. In the off-policy learning scenario, we establish bounds on the regret—the performance gap between our LSE estimator and the optimal policy—assuming bounded $(1+\epsilon)$-th moment of weighted reward. Notably, we achieve a convergence rate of $O(n^{-\epsilon/(1+\epsilon)})$, where $n$ is the number of training samples for the regret bounds and $\epsilon\in[0,1]$. Theoretical analysis is complemented by comprehensive empirical evaluations in both off-policy learning and evaluation scenarios, confirming the practical advantages of our approach.
Paperid:2408
Authors:Shi Liu · Weijie Su · Xizhou Zhu · Wenhai Wang · Jifeng Dai
Abstract:
Recent advancements in Large VisionLanguage Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of central visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we present CoMemo - a dual-path architecture featuring a decoupled memory path for context-independent visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-2D, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo's superior performance compared to conventional LVLM architectures.
Paperid:2409
Authors:Mingxuan Li · Junzhe Zhang · Elias Bareinboim
Abstract:
Reward shaping has been demonstrated to be an effective technique for accelerating the learning process of reinforcement learning (RL) agents. While successful in empirical applications, the design of a good shaping function is less well understood in principle and thus often relies on domain expertise and manual design. To overcome this limitation, we propose a novel automated approach for designing reward functions from offline data, possibly contaminated with the unobserved confounding bias.We propose to use causal state value upper bounds calculated from offline datasets as a conservative optimistic estimation of the optimal state value, which is then used as state potentials in PotentialBased Reward Shaping (PBRS). When applying our shaping function to a model-free learner based on UCB principles, we show that it enjoys a better gap-dependent regret bound than the learner without shaping. To the best of our knowledge, this is the first gap-dependent regret bound for PBRS in model-free learning with online exploration.Simulations support the theoretical findings.
Paperid:2410
Authors:Zijie Qiu · Jiaqi Wei · Xiang Zhang · Sheng Xu · Kai Zou · Zhi Jin · ZhiQiang Gao · Nanqing Dong · Siqi Sun
Abstract:
De novo peptide sequencing is a critical task in proteomics research. However, the performance of current deep learningbased methods is limited by the inherent complexity of mass spectrometry data and the heterogeneous distribution of noise signals, leading to data-specific biases. We present RankNovo, the first deep reranking framework that enhances de novo peptide sequencing by leveraging the complementary strengths of multiple sequencing models. RankNovo employs a list-wise reranking approach, modeling candidate peptides as multiple sequence alignments and utilizing axial attention to extract informative features across candidates. Additionally, we introduce two new metrics, PMD (Peptide Mass Deviation) and RMD (Residual Mass Deviation), which offer delicate supervision by quantifying mass differences between peptides at both the sequence and residue levels. Extensive experiments demonstrate that RankNovo not only surpasses its individual base models, which are used to generate training candidates for reranking pre-training, but also sets a new state-of-the-art de novo sequencing benchmarks. Moreover, RankNovo exhibits strong zero-shot generalization to unseen models—those whose generations were not exposed during training, highlighting its robustness and potential as a universal reranking framework for peptide sequencing. Our work presents a novel reranking strategy that fundamentally challenges existing single-model paradigms and advances the frontier of accurate de novo peptide sequencing. Our source code is provided at an anonymous link.
Paperid:2411
Authors:Matthew Niedoba · Berend Zwartsenberg · Kevin Murphy · Frank Wood
Abstract:
We propose a simple, trainingfree mechanism which explains the generalization behaviour of diffusion models. By comparing pre-trained diffusion models to their theoretically optimal empirical counterparts, we identify a shared local inductive bias across a variety of network architectures. From this observation, we hypothesize that network denoisers generalize through localized denoising operations, as these operations approximate the training objective well over much of the training distribution. To validate our hypothesis, we introduce novel denoising algorithms which aggregate local empirical denoisers to replicate network behaviour. Comparing these algorithms to network denoisers across forward and reverse diffusion processes, our approach exhibits consistent visual similarity to neural network outputs, with lower mean squared error than previously proposed methods.
Paperid:2412
Authors:Junkang Liu · Yuanyuan Liu · Fanhua Shang · Hongying Liu · Jin Liu · Wei Feng
Abstract:
For federated learning (FL) algorithms such as FedSAM, their generalization capability is crucial for realword applications. In this paper, we revisit the generalization problem in FL and investigate the impact of data heterogeneity on FL generalization. We find that FedSAM usually performs worse than FedAvg in the case of highly heterogeneous data, and thus propose a novel and effective federated learning algorithm with Stochastic Weight Averaging (called \texttt{FedSWA}), which aims to find flatter minima in the setting of highly heterogeneous data. Moreover, we introduce a new momentum-based stochastic controlled weight averaging FL algorithm (\texttt{FedMoSWA}), which is designed to better align local and global models. Theoretically, we provide both convergence analysis and generalization bounds for \texttt{FedSWA} and \texttt{FedMoSWA}. We also prove that the optimization and generalization errors of \texttt{FedMoSWA} are smaller than those of their counterparts, including FedSAM and its variants. Empirically, experimental results on CIFAR10/100 and Tiny ImageNet demonstrate the superiority of the proposed algorithms compared to their counterparts.
Paperid:2413
Authors:Peter Holderrieth · Michael Albergo · Tommi Jaakkola
Abstract:
We propose an algorithm to sample from discrete distributions known up to normalization by learning a rate matrix of a continuoustime Markov chain (CTMC). The method can be seen as a continuous-time formulation of annealed importance sampling and sequential Monte Carlo methods, extended so that the variance of the importance weights is offset by the inclusion of the CTMC. To derive these importance weights, we introduce a set of Radon-Nikodym derivatives of CTMCs over their path measures. Because the computation of these weights is intractable with standard neural network parameterizations of rate matrices, we devise a new compact representation for rate matrices via what we call locally equivariant functions. To parameterize them, we introduce a family of locally equivariant multilayer perceptrons, attention, and convolutional networks, and provide an approach to make deep networks that preserve the local equivariance. This property allows us to propose a scalable training algorithm for the rate matrix such that the variance of the importance weights associated to the CTMC are minimal. We demonstrate the efficacy of our method on problems in statistical physics.
Paperid:2414
Authors:Linjian Meng · Youzhi Zhang · Wubing Chen · Wenbin Li · Tianpei Yang · Yang Gao
Abstract:
Nash equilibrium (NE) plays an important role in game theory. How to efficiently compute an NE in NFGs is challenging due to its complexity and nonconvex optimization property. Machine Learning (ML), the cornerstone of modern artificial intelligence, has demonstrated remarkable empirical performance across various applications including non-convex optimization. To leverage non-convex stochastic optimization techniques from ML for approximating an NE, various loss functions have been proposed. Among these, only one loss function is unbiased, allowing for unbiased estimation under the sampled play. Unfortunately, this loss function suffers from high variance, which degrades the convergence rate. To improve the convergence rate by mitigating the high variance associated with the existing unbiased loss function, we propose a novel surrogate loss function named Nash Advantage Loss (NAL). NAL is theoretically proved unbiased and exhibits significantly lower variance than the existing unbiased loss function. Experimental results demonstrate that the algorithm minimizing NAL achieves a significantly faster empirical convergence rates compared to other algorithms, while also reducing the variance of estimated loss value by several orders of magnitude.
Paperid:2415
Authors:Srijith Nair · Michael Lin · Peizhong Ju · Amirreza Talebi · Elizabeth Bentley · Jia (Kevin) Liu
Abstract:
Abstract:Collaborative training methods like Federated Learning (FL) and Split Learning(SL) enable distributed machine learning without sharing raw data. However, FL assumes clients can train entire models, which is infeasible for largescalemodels.In contrast, while SL alleviates the client memory constraint in FL by offloading most training to the server, it increases network latency due to its sequential nature. Other methods address the conundrum by using local loss functions for parallel client-side training to improve efficiency, but they lack server feedback and potentially suffer poor accuracy. We propose FSL-SAGE (Federated Split Learning via Smashed Activation Gradient Estimation), a new federated split learning algorithm that estimates server-side gradient feedback via auxiliary models.These auxiliary models periodically adapt to emulate server behavior on localdatasets. We show that FSL-SAGE achieves a convergence rate of $\mathcal{O}(1/\sqrt{T})$, where $T$ is the number of communication rounds.This result matches FedAvg, while significantly reducing communication costs andclient memory requirements. Our empirical results also verify that it outperforms existing state-of-the-artFSL methods, offering both communication efficiency and accuracy.
Paperid:2416
Authors:Xianhang Li · Haoqin Tu · Mude Hui · Zeyu Wang · Bingchen Zhao · Junfei Xiao · Sucheng Ren · Jieru Mei · Qing Liu · Huangjie Zheng · Yuyin Zhou · Cihang Xie
Abstract:
Abstract:Webcrawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and $\textit{open-sourced}$ LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption ~1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe an average of 3.1% enhanced zero-shot performance cross four cross-modal retrieval tasks using a mixed set of the original and our captions. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries.
Paperid:2417
Authors:Zebin Wang · Menghan Lin · Bolin Shen · Ken Anderson · Molei Liu · Tianxi Cai · Yushun Dong
Abstract:
Graph Neural Networks (GNNs) have demonstrated remarkable utility across diverse applications, and their growing complexity has made Machine Learning as a Service (MLaaS) an essential tool for scalable GNN deployment. However, MLaaS exposes GNN to severe security threats, particularly model extraction attacks (MEAs), where adversaries strategically query a target model to construct a highfidelity replica. To better understand the severity of GNN-based MEAs and explore potential countermeasures, we design a node querying strategy for a highly practical yet underexplored scenario, where bulk query batches are prohibited, and the limited availability of initialization nodes prevents direct model extraction. Our approach incrementally refines the selection of nodes over multiple learning cycles, leveraging historical information to improve extraction efficiency. Extensive experiments on benchmark graph datasets demonstrate that our approach achieves superior accuracy, fidelity, and F1 score under strict query-size constraints, outperforming baselines in comparable settings. These results highlight the vulnerability of real-world GNNs to MEAs despite stringent resource constraints, emphasizing the urgent need for robust defense mechanisms against such threats. Our implementation is available at \url{https://anonymous.4open.science/r/MEA-CEGA/README.md}.
Paperid:2418
Authors:Joseph Paillard · Angel REYERO LOBO · Vitaliy Kolodyazhniy · Thirion Bertrand · Denis-Alexander Engemann
Abstract:
Causal machine learning (ML) promises to provide powerful tools for estimating individual treatment effects. While causal methods have placed some emphasis on heterogeneity in treatment response, it is of paramount importance to clarify the nature of this heterogeneity, by highlighting which variables drive it.We propose PermuCATE, an algorithm based on the Conditional Permutation Importance (CPI) method, for statistically rigorous global variable importance assessment in the estimation of the Conditional Average Treatment Effect (CATE).Theoretical analysis of the finite sample regime and empirical studies show that PermuCATE has lower variance than the LeaveOne-Covariate-Out (LOCO) method and provides a reliable measure of variable importance.This property increases statistical power, which is crucial for causal inference applications with finite sample sizes.We empirically demonstrate the benefits of PermuCATE in simulated and real datasets, including complex settings with high-dimensional, correlated variables.
Paperid:2419
Authors:Tan Songbai · Yao Shu · Xuerui Qiu · Gang Xu · Linrui Xu · Xiangyu Xu · HUIPING ZHUANG · Ming Li · Fei Yu
Abstract:
Invisible watermarking is widely used to protect digital images from unauthorized use. Accurate assessment of watermarking efficacy is crucial for advancing algorithmic development. However, existing statistical metrics, such as PSNR, rely on access to original images, which are often unavailable in textdriven generative watermarking and fail to capture critical aspects of watermarking, particularly visibility. More importantly, these metrics fail to account for potential corruption of image content. To address these limitations, we propose WMarkGPT, the first multimodal large language model (MLLM) specifically designed for comprehensive watermarked image understanding, without accessing original images. WMarkGPT, not only predicts watermark visibility but also generates detailed textual descriptions of its location, content, and impact on image semantics, enabling a more nuanced interpretation of watermarked images. Tackling the challenge of precise location description and understanding images with vastly different content, we construct three visual question-answering (VQA) datasets: an object location-aware dataset, a synthetic watermarking dataset, and a real watermarking dataset. We introduce a meticulously designed three-stage learning pipeline to progressively equip WMarkGPT with the necessary abilities. Extensive experiments on synthetic and real watermarking QA datasets demonstrate that WMarkGPT outperforms existing MLLMs, achieving significant improvements in visibility prediction and content description. The datasets and code will be made publicly available. More details are included in the appendix.
Paperid:2420
Authors:Wei Fan · Kejiang Chen · Chang Liu · Weiming Zhang · Nenghai Yu
Abstract:
The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel twostage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. Our study reveals the limitations of adversarial perturbation-based VC defenses and underscores the urgent need for more robust solutions to mitigate the security and privacy risks posed by VC.
Paperid:2421
Authors:Vaishnavh Nagarajan · Chen Wu · Charles Ding · Aditi Raghunathan
Abstract:
Inopenendedtasks --- such as designing word problems or discovering novel proofs --- the goal is not only correctness but also diversity and originality of thought. Often, this requires a far-sighted leap of thought that we argue is misaligned with the objective of next-token prediction (NTP). To formulate this intuition, we design a suite of minimal algorithmic tasks loosely based on real-world creative endeavors. Concretely, our tasks require an open-endedstochasticplanning step that (a) discovers new connections in a knowledge graph (inspired by word-play, humor or drawing analogies) or (b) constructs new patterns (inspired by constructing word problems, puzzles or mysteries). We then conceptually and empirically argue how NTP leads to myopic learning and excessive memorization, limiting its ability to generate novel solutions. In contrast, we find that multi-token approaches, namely teacherless training and diffusion models, can overcome these limitations and comparatively excel on our algorithmic test-bed. Orthogonally, we find that creativity in our tasks is greatly improved by training with a random hash prefix (which we dub as "hash-conditioning"). Thus our work offers a principled, minimal test-bed for studying open-ended forms of intelligence and also a new angle to take a more serious interest in the paradigm of multi-token prediction.
Paperid:2422
Authors:Zhengrui Ma · Yang Feng · Min zhang
Abstract:
Streaming generation models are increasingly utilized across fields, with the Transducer architecture being very popular in industrial applications. However, its inputsynchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation. In this research, we address this issue by tightly integrating Transducer's decoding with the history of input stream via a learnable monotonic attention. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the context representations of monotonic attention during training. This allows Transducers to adaptively adjust the scope of attention based on their predictions, avoiding the need to enumerate the exponentially large alignment space. Extensive experiments demonstrate that our MonoAttn-Transducer significantly enhances the handling of non-monotonic alignments in streaming generation, offering a robust solution to tackle complex streaming generation tasks.
Paperid:2423
Authors:Fanfei Li · Thomas Klein · Wieland Brendel · Robert Geirhos · Roland S. Zimmermann
Abstract:
Outof-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large datasets scraped from the web, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these standard benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we here introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our proposed corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.
Paperid:2424
Authors:Qin Guo · Ailing Zeng · Dongxu Yue · Ceyuan Yang · Yang Cao · Hanzhong Guo · SHEN FEI · Wei Liu · Xihui Liu · Dan Xu
Abstract:
Although significant advancements have been achieved in the progress of keypointguided Text-to-Image diffusion models, existing mainstream keypoint-guided models encounter challenges in controlling the generation of more general non-rigid objects beyond humans (e.g., animals). Moreover, it is difficult to generate multiple overlapping humans and animals based on keypoint controls solely. These challenges arise from two main aspects: the inherent limitations of existing controllable methods and the lack of suitable datasets. First, we design a DiT-based framework, named UniMC, to explore unifying controllable multi-class image generation. UniMC integrates instance- and keypoint-level conditions into compact tokens, incorporating attributes such as class, bounding box, and keypoint coordinates. This approach overcomes the limitations of previous methods that struggled to distinguish instances and classes due to their reliance on skeleton images as conditions. Second, we propose HAIG-2.9M, a large-scale, high-quality, and diverse dataset designed for keypoint-guided human and animal image generation. HAIG-2.9M includes 786K images with 2.9M instances. This dataset features extensive annotations such as keypoints, bounding boxes, and fine-grained captions for both humans and animals, along with rigorous manual inspection to ensure annotation accuracy. Extensive experiments demonstrate the high quality of HAIG-2.9M and the effectiveness of UniMC, particularly in heavy occlusions and multi-class scenarios.
Paperid:2425
Authors:Yinpeng Chen · Dongdong Chen · Xiyang Dai · Mengchen Liu · Yinan Feng · Youzuo Lin · Lu Yuan · Zicheng Liu
Abstract:
In this paper, we empirically reveals an invariance over images – images share a set of oneway wave equations with latent speeds. Each image is uniquely associated with a solution to these wave equations, allowing for its reconstruction with high fidelity from an initial condition. We demonstrate it using an intuitive encoder-decoder framework where each image is encoded into its corresponding initial condition (a single vector). Subsequently, the initial condition undergoes a specialized decoder, transforming the one-way wave equations into a first-order norm+linear autoregressive process. This process propagates the initial condition along the x and y directions, generating a high-resolution feature map (up to the image resolution), followed by a few convolutional layers to reconstruct image pixels. The revealed invariance, rooted in the shared wave equations, offers a fresh perspective for comprehending images, establishing a promising avenue for further exploration.
Paperid:2426
Authors:Lang Pu · Jingjing Gu · Chao Lin · Xinyi Huang
Abstract:
Abstract:Secure Aggregation (SA) is a cornerstone of Federated Learning (FL), ensuring that user updates remain hidden from servers. Despite advancements like the Flamingo protocol (S\&P’23) improves multiround aggregation and communication efficiency, it still faces several key challenges: scalability issues with dynamic user participation, vulnerability to Model Inconsistency Attacks (MIA), and a lack of verifiability for server-side aggregation result. To address these issues, we propose $\textit{Janus}$, a generic SA scheme based on our dual-server architecture. Janus ensures security against up to $n-2$ colluding clients (where $n$ is the total client count), which prevents privacy breaches for non-colluders. Additionally, Janus is model-independent, ensuring applicability across any FL model without specific adaptations. It simplifies user participation with minimal setup, enhances privacy by mitigating MIA through task separation between two servers, and improves verifiability with a new cryptographic primitive, Separable Homomorphic Commitment, which allows clients to efficiently validate aggregation correctness. Extensive experiments show that Janus not only significantly enhances security but also reduces per-client communication and computation overhead from logarithmic to constant scale, with negligible impact on model performance.
Paperid:2427
Authors:Huafeng Liu · Yiran Fu · Hui Li · Shuyang Lin · Jingyue Shi · Deqiang Ouyang · Liping Jing · Jian Yu
Abstract:
Neural processes (NPs) are a promising paradigm to enable skill transfer learning across tasks with the aid of the distribution of functions. The previous NPs employ the empirical risk minimization principle in optimization. However, the fast adaption ability to different tasks can vary widely, and the worst fast adaptation can be catastrophic in risksensitive tasks. To achieve robust neural processes modeling, we consider the problem of training models in a risk-averse manner, which can control the worst fast adaption cases at a certain probabilistic level. By transferring the risk minimization problem to a two-level finite sum minimax optimization problem, we can easily solve it via a double-looped stochastic mirror prox algorithm with a task-aware variance reduction mechanism via sampling samples across all tasks. The mirror prox technique ensures better handling of complex constraint sets and non-Euclidean geometries, making the optimization adaptable to various tasks. The final solution, by aggregating prox points with the adaptive learning rates, enables a stable and high-quality output. The proposed learning strategy can work with various NPs flexibly and achieves less biased approximation with a theoretical guarantee. To illustrate the superiority of the proposed model, we perform experiments on both synthetic and real-world data, and the results demonstrate that our approach not only helps to achieve more accurate performance but also improves model robustness.
Paperid:2428
Authors:Wanjin Feng · Xingyu Gao · Wenqian Du · Hailong Shi · Peilin Zhao · Pengcheng Wu · Chunyan Miao
Abstract:
Abstract:Spiking Neural Networks (SNNs) often suffer from high time complexity $O(T)$ due to the sequential processing of $T$ spikes, making training computationally expensive. In this paper, we propose a novel Fixedpoint Parallel Training (FPT) method to accelerate SNN training without modifying the network architecture or introducing additional assumptions.FPT reduces the time complexity to $O(K)$, where $K$ is a small constant (usually $K=3$), by using a fixed-point iteration form of Leaky Integrate-and-Fire (LIF) neurons for all $T$ timesteps.We provide a theoretical convergence analysis of FPT and demonstrate that existing parallel spiking neurons can be viewed as special cases of our approach. Experimental results show that FPT effectively simulates the dynamics of original LIF neurons, significantly reducing computational time without sacrificing accuracy. This makes FPT a scalable and efficient solution for real-world applications, particularly for long-duration simulations.
Paperid:2429
Authors:Lufei Liu · Tor Aamodt
Abstract:
Abstract:Graphics rendering applications increasingly leverage neural networks to improve frame rates and the quality of rendered images within realtime latency constraints to support interactivity.Per-frame tasks such as denoising, supersampling, and frame extrapolation introduce a computational burden, creating resource contention in these latency-sensitive workloads. However, the temporal coherence inherent in these workloads presents an opportunity to reduce the overall cost of rendering operations by reusing computation results from previous frames.Recent work has shown that caching intermediate features to be reused in subsequent inferences is an effective method to reduce latency in diffusion models.We extend upon this idea and explore different caching policies to optimize trade-offs between quality and performance for rendering workloads.By reducing the latency of the neural network inferences, we can allocate more resources to high-quality rendering, such as ray tracing, improving both the frame rate and final image quality.Our method can be applied to a variety of encoder-decoder style networks commonly found in rendering pipelines.Experimental results show that we can achieve 1.4$\times$ speedup on average with negligible quality loss in three real-time rendering tasks.We outperform other methods exploiting frame similarity in these tasks and can further enhance their performance when combined.
Paperid:2430
Authors:Narmeen Oozeer · Dhruv Nathawani · Nirmalendu Prakash · Michael Lan · Abir HARRASSE · Amir Abdullah
Abstract:
The study of representation universality in AI models reveals growing convergence across domains, modalities, and architectures. However, this representation universality remains largely unexploited. We bridge this gap by demonstrating that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two wellestablished AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models' outputs in a predictable way. Additionally, we propose a new task, corrupted capabilities, where models are fine-tuned to embed knowledge tied to a backdoor. This tests their ability to separate useful skills from backdoors, reflecting real-world challenges. Extensive experiments across LLaMA, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. Furthermore, we demonstrate that autoencoder mappings between base and fine-tuned models can serve as reliable ``lightweight safety switches", allowing dynamic toggling between model behaviors.
Paperid:2431
Authors:Enming Liang · Minghua Chen
Abstract:
Neural networks (NNs) have emerged as promising tools for solving constrained optimization problems in realtime. However, ensuring constraint satisfaction for NN-generated solutions remains challenging due to prediction errors. Existing methods to ensure NN feasibility either suffer from high computational complexity or are limited to specific constraint types.We present Bisection Projection, an efficient approach to ensure NN solution feasibility for optimization over general compact sets with non-empty interiors.Our method comprises two key components:(i) a dedicated NN (called IPNN) that predicts interior points (IPs) with low eccentricity, which naturally accounts for approximation errors;(ii) a bisection algorithm that leverages these IPs to recover solution feasibility when initial NN solutions violate constraints.We establish theoretical guarantees by providing sufficient conditions for IPNN feasibility and proving bounded optimality loss of the bisection operation under IP predictions. Extensive evaluations on real-world non-convex problems demonstrate that Bisection Projection achieves superior feasibility and computational efficiency compared to existing methods, while maintaining comparable optimality gaps.
Paperid:2432
Authors:Alex Velez-Arce · Marinka Zitnik
Abstract:
Existing biomedical benchmarks do not provide endto-end infrastructure for training, evaluation, and inference of models that integrate multimodal biological data and a broad range of machine learning tasks in therapeutics. We present PyTDC, an open-source machine-learning platform providing streamlined training, evaluation, and inference software for multimodal biological AI models. PyTDC unifies distributed, heterogeneous, continuously updated data sources and model weights and standardizes benchmarking and inference endpoints. This paper discusses the components of PyTDC's architecture and, to our knowledge, the first-of-its-kind case study on the introduced single-cell drug-target nomination ML task. We find state-of-the-art methods in graph representation learning and domain-specific methods from graph theory perform poorly on this task. Though we find a context-aware geometric deep learning method that outperforms the evaluated SoTA and domain-specific baseline methods, the model is unable to generalize to unseen cell types or incorporate additional modalities, highlighting PyTDC's capacity to facilitate an exciting avenue of research developing multimodal, context-aware, foundation models for open problems in biomedical AI.
Paperid:2433
Authors:Qinsi Wang · Hancheng Ye · Ming-Yu Chung · Yudong Liu · Yueqian Lin · Martin Kuo · Mingyuan Ma · Jianyi Zhang · Yiran Chen
Abstract:
Abstract:VisionLanguage Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency.Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered?In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5$\times$ FLOPs reduction and a 10$\times$ overall speedup.
Paperid:2434
Authors:Atsutoshi Kumagai · Tomoharu Iwata · Hiroshi Takahashi · Taishi Nishiyama · Kazuki Adachi · Yasuhiro Fujiwara
Abstract:
Maximizing the area under the receiver operating characteristic curve (AUC) is a standard approach to imbalanced binary classification tasks. Existing AUC maximization methods typically assume that training and test distributions are identical. However, this assumption is often violated due to {\it a covariate shift}, where the input distribution can vary but the conditional distribution of the class label given the input remains unchanged. The importance weighting is a common approach to the covariate shift, which minimizes the test risk with importanceweighted training data. However, it cannot maximize the AUC. In this paper, to achieve this, we theoretically derive two estimators of the test AUC risk under the covariate shift by using positive and unlabeled (PU) data in the training distribution and unlabeled data in the test distribution. Our first estimator is calculated from importance-weighted PU data in the training distribution, and the second one is calculated from importance-weighted positive data in the training distribution and unlabeled data in the test distribution. We train classifiers by minimizing a weighted sum of the two AUC risk estimators that approximates the test AUC risk. Unlike the existing importance weighting, our method does not require negative labels and class-priors. We show the effectiveness of our method with six real-world datasets.
Paperid:2435
Authors:Joseph Lazzaro · Ciara Pike-Burke
Abstract:
Piecewise constant functions describe a variety of realworld phenomena in domains ranging from chemistry to manufacturing. In practice, it is often required to confidently identify the locations of the abrupt changes in these functions as quickly as possible. For this, we introduce a fixed-confidence piecewise constant bandit problem. Here, we sequentially query points in the domain and receive noisy evaluations of the function under bandit feedback. We provide instance-dependent lower bounds for the complexity of change point identification in this problem. These lower bounds illustrate that an optimal method should focus its sampling efforts adjacent to each of the change points, and the number of samples around each change point should be inversely proportional to the magnitude of the change.Building on this, we devise a simple and computationally efficient variant of Track-and-Stop and prove that it is asymptotically optimal in many regimes. We support our theoretical findings with experimental results in synthetic environments demonstrating the efficiency of our method.
Paperid:2436
Authors:Vladimir Kostic · Karim Lounici · Hélène Halconruy · Timothée Devergne · Pietro Novelli · Massimiliano Pontil
Abstract:
Markov processes serve as a universal model for many realworld random processes. This paper presents a data-driven approach for learning these models through the spectral decomposition of the infinitesimal generator (IG) of the Markov semigroup. The unbounded nature of IGs complicates traditional methods such as vector-valued regression and Hilbert-Schmidt operator analysis. Existing techniques, including physics-informed kernel regression, are computationally expensive and limited in scope, with no recovery guarantees for transfer operator methods when the time-lag is small. We propose a novel method that leverages the IG's resolvent, characterized by the Laplace transform of transfer operators. This approach is robust to time-lag variations, ensuring accurate eigenvalue learning even for small time-lags. Our statistical analysis applies to a broader class of Markov processes than current methods while reducing computational complexity from quadratic to linear in the state dimension. Finally, we illustrate the behaviour of our method in two experiments.
Paperid:2437
Authors:Li-Wei Chen · Takuya Higuchi · Zakaria Aldeneh · Ahmed Hussen Abdelaziz · Alexander Rudnicky
Abstract:
The success of large language models in text processing has inspired their adaptation to speech modeling. However, due to the continuous and complex nature of speech, it is often discretized into tokens.Among these, speech tokens derived from selfsupervised speech models (e.g., HuBERT) typically focus on the linguistic aspects of speech and neglect prosodic information.As a result, autoregressive models trained on these tokens may generate speech with reduced naturalness.Existing approaches have attempted to address this limitation by adding pitch features to these semantic tokens prior to autoregressive modeling.However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering.To overcome this challenge, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens.Our approach eliminates the need for manual extraction and selection of paralinguistic features.Moreover, we demonstrate that our proposed approach produces much preferred speech continuations according to human raters.
Paperid:2438
Authors:Yazid Janati el idrissi · Badr MOUFAD · Mehdi Qassime · Alain Oliviero Durmus · Eric Moulines · Jimmy Olsson
Abstract:
Denoising diffusion models have driven significant progress in the field of Bayesian inverse problems. Recent approaches use pretrained diffusion models as priors to solve a wide range of such problems, only leveraging inference-time compute and thereby eliminating the need to retrain task-specific models on the same dataset. To approximate the posterior of a Bayesian inverse problem, a diffusion model samples from a sequence of intermediate posterior distributions, each with an intractable likelihood function. This work proposes a novel mixture approximation of these intermediate distributions. Since direct gradient-based sampling of these mixtures is infeasible due to intractable terms, we propose a practical method based on Gibbs sampling. We validate our approach through extensive experiments on image inverse problems, utilizing both pixel- and latent-space diffusion priors, as well as on source separation with an audio diffusion model.
Paperid:2439
Authors:Tim Pearce · Tabish Rashid · David Bignell · Raluca Georgescu · Sam Devlin · Katja Hofmann
Abstract:
The performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute. This has been demonstrated in domains from robotics to video games, when generative learning objectives on offline datasets (pretraining) are used to model an agent's behavior (imitation learning) or their environment (world modeling). This paper characterizes the role of scale in these tasks more precisely. Going beyond the simple intuition that `bigger is better', we show that the same types of power laws found in language modeling also arise in world modeling and imitation learning (e.g. between loss and optimal model size). However, the coefficients of these laws are heavily influenced by the tokenizer, task \& architecture -- this has important implications on the optimal sizing of models and data.
Paperid:2440
Authors:Xiaoyu Wu · Jiaru Zhang · Steven Wu
Abstract:
Diffusion Models (DMs) have evolved into advanced image generation tools, especially for fewshot fine-tuning where a pretrained DM is fine-tuned on a small set of images to capture specific styles or objects. Many people upload these personalized checkpoints online, fostering communities such as Civitai and HuggingFace. However, model owners may overlook the potential risks of data leakage by releasing their fine-tuned checkpoints. Moreover, concerns regarding copyright violations arise when unauthorized data is used during fine-tuning. In this paper, we ask: "Can training data be extracted from these fine-tuned DMs shared online?" A successful extraction would present not only data leakage threats but also offer tangible evidence of copyright infringement. To answer this, we propose FineXtract, a framework for extracting fine-tuning data. Our method approximates fine-tuning as a gradual shift in the model's learned distribution---from the original pretrained DM toward the fine-tuning data. By extrapolating the models before and after fine-tuning, we guide the generation toward high-probability regions within the fine-tuned data distribution. We then apply a clustering algorithm to extract the most probable images from those generated using this extrapolated guidance. Experiments on DMs fine-tuned with datasets such as WikiArt, DreamBooth, and real-world checkpoints posted online validate the effectiveness of our method, extracting approximately 20\% of fine-tuning data in most cases, significantly surpassing baseline performance. The code is available anonymously.
Paperid:2441
Authors:Xiaoyuan Zhang · Peijie Li · Ying Ying YU · Yichi Zhang · Han Zhao · Qingfu Zhang
Abstract:
Distribution matching is a key technique in machine learning, with applications in generative models, domain adaptation, and algorithmic fairness. A related but less explored challenge is generating a distribution that aligns with multiple underlying distributions, often with conflicting objectives, known as a Pareto optimal distribution.In this paper, we develop a general theory based on information geometry to construct the Pareto set and front for the entire exponential family under KL and inverse KL divergences. This formulation allows explicit derivation of the Pareto set and front for multivariate normal distributions, enabling applications like multiobjective variational autoencoders (MOVAEs) to generate interpolated image distributions. Additionally, we propose a nonparametric approach using adversarial distribution matching, leading to the multiobjective generative adversarial network (MOGAN) algorithm. Experimental results on real-world images demonstrate that both algorithms can generate high-quality interpolated images across multiple distributions.
Paperid:2442
Authors:Sang Truong · Yuheng Tu · Percy Liang · Bo Li · Sanmi Koyejo
Abstract:
Current generative model evaluation procedures are costly and sensitive to test set selection, making iterative evaluation impractical. In this paper, we employ a modelbased evaluation framework using Item Response Theory (IRT), which decouples model performance from the test subset selection, improving the reliability and efficiency of the evaluation. Further, we propose two innovations: amortized calibration to reduce the cost of estimating item parameters of the IRT model and a conditional item generator based on a large language model to automate diverse item generation. Our experiments on 25 common natural language benchmarking datasets and 184 language models show that this approach is more reliable and efficient compared to traditional evaluation methods, offering a scalable solution to evaluating generative models. Our implementation is available at \url{https://anonymous.4open.science/r/reeval-1187}.
Paperid:2443
Authors:Changze Lv · Jingwen Xu · Yiyang Lu · Xiaohua Wang · Zhenghua Wang · Zhibo Xu · Di Yu · Xin Du · Xiaoqing Zheng · Xuanjing Huang
Abstract:
Backpropagation is the foundational algorithm for training neural networks and a key driver of deep learning's success.However, its biological plausibility has been challenged due to three primary limitations: weight symmetry, reliance on global error signals, and the dualphase nature of training, as highlighted by the existing literature. Although various alternative learning approaches have been proposed to address these issues, most either fail to satisfy all three criteria simultaneously or yield suboptimal results.Inspired by the dynamics and plasticity of pyramidal neurons, we propose Dendritic Localized Learning (DLL), a novel learning algorithm designed to overcome these challenges.Extensive empirical experiments demonstrate that DLL satisfies all three criteria of biological plausibility while achieving state-of-the-art performance among algorithms that meet these requirements.Furthermore, DLL exhibits strong generalization across a range of architectures, including MLPs, CNNs, and RNNs.These results, benchmarked against existing biologically plausible learning algorithms, offer valuable empirical insights for future research.We hope this study can inspire the development of new biologically plausible algorithms for training multilayer networks and advancing progress in both neuroscience and machine learning.
Paperid:2444
Authors:Alexandre Blain · Thirion Bertrand · Pierre Neuvial
Abstract:
Abstract:Split Conformal Prediction (SCP) provides a computationally efficient way to construct confidence intervals in prediction problems. Notably, most of the theory built around SCP is focused on the single test point setting. In reallife, inference sets consist of multiplepoints, which raises the question of coverage guarantees for many points simultaneously. While *on average*, the False Coverage Proportion (FCP) remains controlled, it can fluctuate strongly around its mean, the False Coverage Rate (FCR). We observe that when adataset is split multiple times, classical SCP may not control the FCP in a majority of the splits. We propose CoJER, a novel method that achieves sharp FCP control in probability for conformal prediction, based on a recent characterization of the distribution of conformal $p$-values in a transductive setting. This procedure incorporates an aggregation scheme which provides robustness with respect to modeling choices. We show through extensive real data experiments that CoJER provides FCP control while standard SCP does not. Furthermore, CoJER yields shorter intervals than the *state-of-the-art method* for FCP control and only slightly larger intervals than standard SCP.
Paperid:2445
Authors:Yilong Chen · Junyuan Shang · Zhenyu Zhang · Jiawei Sheng · Tingwen Liu · Shuohuan Wang · Yu Sun · Hua Wu · Haifeng Wang
Abstract:
Transformer models encounter inefficiency when scaling hidden dimensions due to the uniform expansion of parameters. When delving into the sparsity of hidden dimensions, we observe that only a small subset of dimensions are highly activated, where some dimensions are commonly activated across tokens, and some others uniquely activated for individual tokens. To leverage this, we propose MoHD (Mixture of Hidden Dimensions), a sparse architecture that combines shared subdimensions for common features and dynamically routes specialized sub-dimensions per token. To address the potential information loss from sparsity, we introduce activation scaling and group fusion mechanisms. MoHD efficiently expands hidden dimensions with minimal computational increases, outperforming vanilla Transformers in both parameter efficiency and task performance across 10 NLP tasks. MoHD achieves 1.7% higher performance with 50% fewer activatied parameters and 3.7% higher performance with 3× total parameters expansion at constant activated parameters cost. MoHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity.
Paperid:2446
Authors:Alexander Lobashev · Dmitry Guskov · Maria Larchenko · Mikhail Tamm
Abstract:
This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the logpartition function, which defines the Fisher metric for exponential families. Theoretical convergence guarantees are provided, and the method is validated on the Ising and TASEP models, outperforming existing baselines in reconstructing thermodynamic quantities. Applied to diffusion models, the method reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher metric. We demonstrate that while geodesic interpolations are approximately linear within individual phases, this linearity breaks down at phase boundaries, where the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space. These findings provide new insights into the complex structure of diffusion model latent spaces and their connection to phenomena like phase transitions.
Paperid:2447
Authors:Catherine Chen · Jingyan Shen · Xinyu Yang · Lihua Lei
Abstract:
Recent developments in large language models (LLMs) have led to their widespread usage for various tasks. The prevalence of LLMs in society implores the assurance on the reliability of their performance. In particular, risksensitive applications demand meticulous attention to unexpectedly poor outcomes, i.e., tail events, for instance, toxic answers, humiliating language, and offensive outputs. Due to the costly nature of acquiring human annotations, general-purpose scoring models have been created to automate the process of quantifying these tail events. This phenomenon introduces potential human-machine misalignment between the respective scoring mechanisms. In this work, we present a light-weight calibration framework for blackbox models that ensures the alignment of humans and machines with provable guarantees. Our framework provides a rigorous approach to controlling any distortion risk measure that is characterized by a weighted average of quantiles of the loss incurred by the LLM with high confidence. The theoretical foundation of our method rests on the connection between conformal risk control and a traditional family of statistics, i.e., L-statistics. To demonstrate the utility of our framework, we conduct comprehensive experiments that address the issue of human-machine misalignment.
Paperid:2448
Authors:Arsalan Sharifnassab · Saber Salehkaleybar · Rich Sutton
Abstract:
We address the challenge of optimizing metaparameters (hyperparameters) in machine learning, a key factor for efficient training and high model performance. Rather than relying on expensive meta-parameter search methods, we introduce MetaOptimize: a dynamic approach that adjusts meta-parameters, particularly step sizes (also known as learning rates), during training. More specifically, MetaOptimize can wrap around any first-order optimization algorithm, tuning step sizes on the fly to minimize a specific form of regret that considers the long-term impact of step sizes on training, through a discounted sum of future losses. We also introduce lower-complexity variants of MetaOptimize that, in conjunction with its adaptability to various optimization algorithms, achieve performance comparable to those of the best hand-crafted learning rate schedules across diverse machine learning tasks.
Paperid:2449
Authors:Lang Feng · Weihao Tan · Zhiyi Lyu · Longtao Zheng · Haiyang Xu · Ming Yan · Fei Huang · Bo An
Abstract:
Online finetuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo's convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo’s effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains.
Paperid:2450
Authors:Fengqiang Wan · Yang Yang
Abstract:
Incremental learning (IL) aims to sequentially learn new tasks while mitigating catastrophic forgetting. Among various IL strategies, parameterisolation methods stand out by using mask techniques to allocate distinct parameters to each task, explicitly addressing forgetting. However, existing approaches often disregard parameter dependencies, resulting in an over-reliance on newly allocated parameters. To address this issue, we propose Probabilistic Group Mask selection (PGM), a group-wise approach that captures parameter dependencies by exploring candidate masks within each group. Specifically, PGM partitions parameters into groups with multiple candidate masks, assigning probabilities to these masks and leveraging Gumbel-Softmax for differentiable sampling, enabling efficient optimization of the discrete mask selection process. Our theoretical analysis demonstrates that incorporating parameter dependencies enhances sub-network selection. Experiments conducted on standard benchmarks confirm its superior effectiveness compared to existing IL approaches. The source code is available at: https://anonymous.4open.science/r/PGM-A139.
Paperid:2451
Authors:Keitaro Sakamoto · Issei Sato
Abstract:
Attention mechanism is a fundamental component of the transformer model and plays a significant role in its success.However, the theoretical understanding of how attention learns to select tokens is still an emerging area of research.In this work, we study the training dynamics and generalization ability of the attention mechanism, under classification problems with label noise.We show that, with the characterization of signalto-noise ratio (SNR), the token selection of attention mechanism achieves ``benign overfitting'', i.e., maintaining high generalization performance despite fitting label noise.Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting.Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets.
Paperid:2452
Authors:Kevin Xu · Issei Sato
Abstract:
Looped Transformers provide advantages in parameter efficiency, computational capabilities, and generalization for reasoning tasks. However, their expressive power regarding function approximation remains underexplored. In this paper, we establish the approximation rate of Looped Transformers by defining the modulus of continuity for sequenceto-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts the incorporation of scaling parameters for each loop, conditioned on timestep encoding. Experiments validate the theoretical results, showing that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding.
Paperid:2453
Authors:Lucas Rosenblatt · Bin Han · Robert Wolfe · Bill Howe
Abstract:
Large language models (LLMs) can leak sensitive training data through memorization and membership inference attacks. Prior work has primarily focused on strong adversarial assumptions, including attacker access to entire samples or long, ordered prefixes, leaving open the question of how vulnerable LLMs are when adversaries have only partial, unordered sample information. For example, if an attacker knows a patient has "hypertension," under what conditions can they query a model finetuned on patient data to learn the patient also has "osteoarthritis?" In this paper, we introduce a more general threat model under this weaker assumption and show that fine-tuned LLMs are susceptible to these fragment-specific extraction attacks. To systematically investigate these attacks, we propose two data-blind methods: (1) a likelihood ratio attack inspired by methods from membership inference, and (2) a novel approach, PRISM, which regularizes the ratio by leveraging an external prior. Using examples from both medical and legal settings, we show that both methods are competitive with a data-aware baseline classifier that assumes access to labeled in-distribution data, underscoring their robustness.
Paperid:2454
Authors:Ekaterina Filimoshina · Dmitry Shirokov
Abstract:
We propose, implement, and compare with competitors a new architecture of equivariant neural networks based on geometric (Clifford) algebras: Generalized Lipschitz Group Equivariant Neural Networks (GLGENN). These networks are equivariant to all pseudoorthogonal transformations, including rotations and reflections, of a vector space with any non-degenerate or degenerate symmetric bilinear form. We propose a weight-sharing parametrization technique that takes into account the fundamental structures and operations of geometric algebras. Due to this technique, GLGENN architecture is parameter-light and has less tendency to overfitting than baseline equivariant models. GLGENN outperforms or matches competitors on several benchmarking equivariant tasks, including estimation of an equivariant function and a convex hull experiment, while using significantly fewer optimizable parameters.
Paperid:2455
Authors:Shuo Xie · Tianhao Wang · Sashank J. Reddi · Sanjiv Kumar · Zhiyuan Li
Abstract:
In this paper, we present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kroneckerfactored) preconditioners for both online regret minimization and offline convex optimization. Our analysis not only provides matching rate to several important structured preconditioned algorithms including diagonal AdaGrad, full-matrix AdaGrad, and AdaGrad-Norm, but also gives an improved convergence rate for a one-sided variant of Shampoo over that of original Shampoo. Interestingly, more structured preconditioners (e.g., diagonal Adagrad, AdaGrad-Norm which use less space and compute) are often presented as computationally efficient approximations to full-matrix Adagrad, aiming improved optimization performance through better approximations. Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than full-matrix AdaGrad could outperform it both theoretically and experimentally.
Paperid:2456
Authors:Hanru Bai · Weiyang Ding
Abstract:
Neural ordinary differential equations (NODEs) have demonstrated strong capabilities in modeling time series. However, existing NODEbased methods often focus solely on the surface-level dynamics derived from observed states, which limits their ability to capture more complex underlying behaviors. To overcome this challenge, we propose KoNODE, a Koopman-driven NODE framework that explicitly models the evolution of ODE parameters over time to encode deep-level information. KoNODE captures the essential yet simple intrinsic linear dynamics that govern the surface dynamics by employing Koopman operators. Our framework operates at three hierarchical levels: the observed state dynamic, the parameter dynamic, and the Koopman linear dynamic, representing the fundamental driving rules of the state dynamics. The proposed approach offers significant improvements in two critical time series tasks: long-term prediction (enabled by the simple linear dynamics) and generalization to new data (driven by the evolving ODE parameters). We validate KoNODE through experiments on synthetic data from complex dynamic systems and real-world datasets, demonstrating its effectiveness in practical scenarios.
Paperid:2457
Authors:Fengchun Qiao · Yanlin Chen · Xi Peng
Abstract:
Ensemble learning is a powerful approach for improving generalization under distribution shifts, but its effectiveness heavily depends on how individual models are combined. Existing methods often optimize ensemble weights based on validation data, which may not represent unseen test distributions, leading to suboptimal performance in outof-distribution (OoD) settings. Inspired by Distributionally Robust Optimization (DRO), we propose Structure-informed Risk Minimization (SRM), a principled framework that learns robust ensemble weights without access to test data. Unlike standard DRO, which defines uncertainty sets based on divergence metrics alone, SRM incorporates structural information of training distributions, ensuring that the uncertainty set aligns with plausible real-world shifts. This approach mitigates the over-pessimism of traditional worst-case optimization while maintaining robustness. We introduce a computationally efficient optimization algorithm with theoretical guarantees and demonstrate that SRM achieves superior OoD generalization compared to existing ensemble combination strategies across diverse benchmarks.
Paperid:2458
Authors:Kevin Han Huang · Ni Zhan · Elif Ertekin · Peter Orbanz · Ryan P. Adams
Abstract:
Incorporating group symmetries into neural networks has been a cornerstone of success in many AIfor-science applications. Diagonal groups of isometries, which describe the invariance under a simultaneous movement of multiple objects, arise naturally in many-body quantum problems. Despite their importance, diagonal groups have received relatively little attention, as they lack a natural choice of invariant maps except in special cases. We study different ways of incorporating diagonal invariance in neural network ansatze trained via variational Monte Carlo methods, and consider specifically data augmentation, group averaging and canonicalization. We show that, contrary to standard ML setups, in-training symmetrization destabilizes training and can lead to worse performance. Our theoretical and numerical results indicate that this unexpected behavior may arise from a unique computational-statistical tradeoff not found in standard ML analyses of symmetrization. Meanwhile, we demonstrate that post hoc averaging is less sensitive to such tradeoffs and emerges as a simple, flexible and effective method for improving neural network solvers.
Paperid:2459
Authors:Aditya Gangrade · Aldo Pacchiano · Clay Scott · Venkatesh Saligrama
Abstract:
Abstract:We study the 'feasible action search' (FAS) problem for linear bandits, wherein a learner attempts to discover a feasible point for a set of linear constraints $\Phi_* a \ge 0,$ without knowledge of the matrix $\Phi_* \in \mathbb{R}^{m \times d}$. A FAS learner selects a sequence of actions $a_t,$ and uses observations of the form $\Phi_* a_t + \mathrm{noise}$ to either find a point with nearly optimal 'safety margin', or detect that the constraints are infeasible, where the safety margin of an action measures its (signed) distance from the constraint boundary. While of interest in its own right, the FAS problem also directly addresses a key deficiency in the extant theory of 'safe linear bandits' (SLBs), by discovering a safe initialisation for lowregret SLB methods.We propose and analyse a novel efficient FAS-learner. Our method, FAST, is based on Thompson Sampling. It applies a _coupled_ random perturbation to an estimate of $\Phi_*,$ and plays a maximin point of a game induced by this perturbed matrix. We prove that FAST stops in $\tilde{O}(d^3/\varepsilon^2 M_\*^2)$ steps, and incurs $\tilde{O}(d^3/|M_\*|)$ safety costs, to either correctly detect infeasibility, or output a point that is at least $(1-\varepsilon) M_*$-safe, where $M_*$ is the _optimal safety margin_ of $\Phi_*$. Further, instantiating prior SLB methods with the output of FAS yields the first SLB methods that incur $\tilde{O}(\sqrt{d^3 T/M_*^2})$ regret and $O(1)$ risk without a priori knowledge of a safe action. The main technical novelty lies in the extension of Thompson Sampling to this multiobjective setting, for which we both propose a coupled noise design, and provide an analysis that avoids convexity considerations.
Paperid:2460
Authors:Xingyu Wu · Jibin Wu · Yu Zhou · Liang Feng · KC Tan
Abstract:
Algorithm selection aims to identify the optimal performing algorithm before execution. Existing techniques typically focus on the observed correlations between algorithm performance and metafeatures. However, little research has explored the underlying mechanisms of algorithm selection, specifically what characteristics an algorithm must possess to effectively tackle problems with certain feature values. This gap not only limits the explainability but also makes existing models vulnerable to data bias and distribution shift. This paper introduces directed acyclic graph (DAG) to describe this mechanism, proposing a novel modeling paradigm that aligns more closely with the fundamental logic of algorithm selection. By leveraging DAG to characterize the algorithm feature distribution conditioned on problem features, our approach enhances robustness against marginal distribution changes and allows for finer-grained predictions through the reconstruction of optimal algorithm features, with the final decision relying on differences between reconstructed and rejected algorithm features. Furthermore, we demonstrate that, the learned DAG and the proposed counterfactual calculations offer our approach with both model-level and instance-level explainability. Extensive experiments on the ASlib benchmark validate the advantages of the proposed model in terms of robustness and explainability. The code will make publicly available after the review process.
Paperid:2461
Authors:Xiang Fang · Arvind Easwaran · Blaise Genest
Abstract:
Outof-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD de- tection methods require many ID samples for training, which seriously limits their real-world applications. To this end, we target a challeng- ing setting: few-shot OOD detection, where only a few labeled ID samples are available. There- fore, few-shot OOD detection is much more chal- lenging than the traditional OOD detection set- ting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel net- work: Adaptive Multi-prompt Contrastive Net- work (AMCN), which adapts the ID-OOD sepa- ration boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID image samples, we leverage CLIP, connecting text with images, en- gineering learnable ID and OOD textual prompts. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts and label-adaptive OOD prompts). Then, we gen- erate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that our proposed method outperforms other state-of- the-art works.
Paperid:2462
Authors:Zhenghai Xue · Lang Feng · Jiacheng Xu · Kang Kang · xiang wen · Bo An · Shuicheng YAN
Abstract:
To learn from data collected in diverse dynamics, Imitation from Observation (IfO) methods leverage expert state trajectories based on thepremise that recovering expert state distributions in other dynamics facilitates policy learning in the current one. However, Imitation Learning inherently imposes a performance upper bound of learned policies. Additionally, as the environment dynamics change, certain expert states may become inaccessible, rendering their distributions less valuable for imitation. To address this, we propose a novel framework that integrates reward maximization with IfO, employing Fdistance regularized policy optimization. This framework enforces constraints on globally accessible states—those with nonzero visitation frequency across all considered dynamics—mitigating the challenge posed by inaccessible states. By instantiating F-distance in different ways, we derive two theoretical analysis and develop a practical algorithm called Accessible State Oriented Policy Regularization (ASOR). ASOR serves as a general-purpose module that can be incorporated into various RL approaches, including offline RL and off-policy RL. Extensive experiments across multiple benchmarks demonstrate ASOR’s effectiveness in enhancing state-of-the-art cross-domain policy transfer algorithms, significantly improving their performance.
Paperid:2463
Authors:Xingyu Zhu · Abhishek Panigrahi · Sanjeev Arora
Abstract:
We formalize a new concept for LLMs,contextenhanced learning. It involves standard gradient-based learning on text except that the context is enhanced with additional data on which no auto-regressive gradients are computed. This setting is a gradient-based analog of usual in-context learning (ICL) and appears in some recent works.Using a multi-step reasoning task, we prove in a simplified setting that context-enhanced learning can beexponentially more sample-efficientthan standard learning when the model is capable of ICL. At a mechanistic level, we find that the benefit of context-enhancement arises from a more accurate gradient learning signal.We also experimentally demonstrate that it appears hard to detect or recover learning materials that were used in the context during training. This may have implications for data security as well as copyright.
Paperid:2464
Authors:Hanlin Yu · Arto Klami · Aapo Hyvarinen · Anna Korba · L'Emir Omar Chehab
Abstract:
Density ratio estimation in high dimensions can be reframed as integrating a certain quantity, the time score, over probability paths which interpolate between the two densities. In practice, the time score has to be estimated based on samples from the two densities. However, existing methods for this problem remain computationally expensive and can yield inaccurate estimates. Inspired by recent advances in generative modeling, we introduce a novel framework for time score estimation, based on a conditioning variable. Choosing the conditioning variable judiciously enables a closedform objective function. We demonstrate that, compared to previous approaches, our approach results in faster learning of the time score and competitive or better estimation accuracies of the density ratio on challenging tasks. Furthermore, we establish theoretical guarantees on the error of the estimated density ratio.
Paperid:2465
Authors:Aditi Ashutosh Mavalankar · Hassan Mansoor · Zita Marinho · Mariia Samsikova · Tom Schaul
Abstract:
Abstract:Scaling up inferencetime compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additional inference-time compute is self-repair; given an initial flawed response or guess, the LLM corrects its own mistake and produces an improved response or fix. We leverage the in-context learning ability of LLMs to perform self-repair in the coding domain. The key contribution of our paper is an approach that synthesises and selects an ordered set of golden example pairs, or AuPairs, of these initial guesses and subsequent fixes for the corresponding problems. Each such AuPair is provided as a single in-context example at inference time to generate a repaired solution. For an inference-time compute budget of $N$ LLM calls per problem, $N$ AuPairs are used to generate $N$ repaired solutions, out of which the highest-scoring solution is the final answer. The underlying intuition is that if the LLM is given a different example of fixing an incorrect guess each time, it can subsequently generate a diverse set of repaired solutions. Our algorithm selects these AuPairs in a manner that maximises complementarity and usefulness. We demonstrate the results of our algorithm on 5 LLMs across 7 competitive programming datasets for the code repair task. Our algorithm yields a significant boost in performance compared to best-of-$N$ and self-repair, and also exhibits strong generalisation across datasets and models. Moreover, our approach shows stronger scaling with inference-time compute budget compared to baselines.
Paperid:2466
Authors:Honglin Yuan · Xingfeng Li · Jian Dai · Xiaojian You · Yuan Sun · Zhenwen Ren
Abstract:
Existing deep multiview clustering methods have demonstrated excellent performance, which addressing issues such as missing views and view noise. But almost all existing methods are within a static framework, which assumes that all views have already been collected. However, in practical scenarios, new views are continuously collected over time, which forms the stream of views. Additionally, there exists the data imbalance of quality and distribution between different view streams, i.e., concept drift problem. To this end, we propose a novel Deep Streaming View Clustering (DSVC) method, which mitigates the impact of concept drift on streaming view clustering. Specifically, DSVC consists of a knowledge base and three core modules. Through the knowledge aggregation learning module, DSVC extracts representative features and prototype knowledge from the new view. Subsequently, the distribution consistency learning module aligns the prototype knowledge from the current view with the historical knowledge distribution to mitigate the impact of concept drift. Then, the knowledge guidance learning module leverages the prototype knowledge to guide the data distribution and enhance the clustering structure. Finally, the prototype knowledge from the current view is updated in the knowledge base to guide the learning of subsequent views. Extensive experiments demonstrate that, even in dynamic environments, the clustering performance of DSVC outperforms 12 state-of-the-art DMVC methods under static frameworks.
Paperid:2467
Authors:Qihe Huang · Zhengyang Zhou · Kuo Yang · Zhongchao Yi · Xu Wang · Yang Wang
Abstract:
Abstract:Longterm time series forecasting (LTSF) has traditionally relied on large parameters to capture extended temporal dependencies, resulting in substantial computational costs and inefficiencies in both memory usage and processing time. However, time series data, unlike high-dimensional images or text, often exhibit temporal pattern similarity and low-rank structures, especially in long-term horizons. By leveraging this structure, models can be guided to focus on more essential, concise temporal features, improving both accuracy and computational efficiency.In this paper, we introduce TimeBase, an ultra-lightweight network with fewer than 0.39$k$ parameters, designed to harness the power of minimalism in LTSF. TimeBase 1) extracts core basis temporal components and 2) transforms traditional point-level forecasting into efficient segment-level forecasting, achieving optimal utilization of both data and parameters.Extensive experiments conducted on 21 real-world datasets of varying scales and domains demonstrate that TimeBase not only achieves minimalism in both model size and computational cost, reducing MACs by \textbf{15.2x}, parameter counts by \textbf{342.5x}, and inference time to \textbf{3.3x} compared to the lightest prediction models, but also delivers state-of-the-art forecasting performance, ranking in the Top 2 across \textbf{158} out of \textbf{168} prediction settings.Additionally, TimeBase can also serve as a very effective plug-and-play complexity reducer for any patch-based forecasting models, enabling extreme complexity reduction (i.e., \textbf{77.74\%$\sim$93.00\%} MACs reduction of PatchTST) with promoting prediction accuracy.Code is available at \url{https://anonymous.4open.science/r/TimeBase_ICML}.
Paperid:2468
Authors:Zequn Yang · Hongfa Wang · Di Hu
Abstract:
Interactions between modalities—redundancy, uniqueness, and synergy—collectively define the structure of multimodal information. These interactions are essential for understanding the underlying dynamics of multimodal data and determining the performance of multimodal systems. While significant progress has been made in quantifying interactions at the distribution level, estimating samplelevel interactions in real-world datasets remains a significant challenge. To address this, we propose a Lightweight Sample-wise Multimodal Interaction (LSMI) estimation approach, rooted in pointwise information theory.Starting with redundancy—the most intuitive interaction to measure—we develop a redundancy estimation framework using a suitable pointwise information measure. Building on this framework, we design an interaction estimation paradigm based on an efficient entropy estimation method, well-suited for sample-wise analysis in continuous distributions.We validate our approach through extensive experiments on meticulously designed synthetic datasets and various real-world datasets, demonstrating precision and efficiency. Crucially, our sample-wise estimation reveals fine-grained sample/category dynamics in multimodal data, enhancing interpretability by distilling knowledge from interactions and capturing sub-model specificity in ensemble learning.
Paperid:2469
Authors:Ruchika Chavhan · Abhinav Mehrotra · Malcolm Chadwick · Alberto Gil Couto Pimentel Ramos · Luca Morreale · Mehdi Noroozi · Sourav Bhattacharya
Abstract:
Textto-image synthesis has witnessed remarkable advancements in recent years. Many attempts have been made to adopt text-to-image models to support multiple tasks. However, existing approaches typically require resource-intensive re-training or additional parameters to accommodate for the new tasks, which makes the model inefficient for on-device deployment. We proposeMulti-Task Upcycling(MTU), a simple yet effective recipe that extends the capabilities of a pre-trained text-to-image diffusion model to support a variety of image-to-image generation tasks. MTU replaces Feed-Forward Network (FFN) layers in the diffusion model with smaller FFNs, referred to asexperts, and combines them with a dynamic routing mechanism. To the best of our knowledge, MTU is the first multi-task diffusion modeling approach that seamlessly blends multi-tasking with on-device compatibility, by mitigating the issue of parameter inflation. We show that the performance of MTU is on par with the single-task fine-tuned diffusion models across several tasks includingimage editing, super-resolution, andinpainting, while maintaining similar latency and computational load (GFLOPs) as the single-task fine-tuned models.
Paperid:2470
Authors:Alexander Capstick · Rahul G. Krishnan · Payam Barnaghi
Abstract:
Large language models (LLMs) acquire a breadth of information across various domains. However, their computational complexity, cost, and lack of transparency often hinder their direct application for predictive tasks where privacy and interpretability are paramount. In fields such as healthcare, biology, and finance, specialised and interpretable linear models still hold considerable value. In such domains, labelled data may be scarce or expensive to obtain. Wellspecified prior distributions over model parameters can reduce the sample complexity of learning through Bayesian inference; however, eliciting expert priors can be time-consuming. We therefore introduce AutoElicit to extract knowledge from LLMs and construct priors for predictive models. We show these priors are informative and can be refined using natural language. We perform a careful study contrasting AutoElicit with in-context learning and demonstrate how to perform model selection between the two methods. We find that AutoElicit yields priors that can substantially reduce error over uninformative priors, using fewer labels, and consistently outperform in-context learning. We show that AutoElicit saves over 6 months of labelling effort when building a new predictive model for urinary tract infections from sensor recordings of people living with dementia.
Paperid:2471
Authors:Alireza Kazemipour · Matthew Taylor · Michael Bowling
Abstract:
A tenet of reinforcement learning is that rewards are always observed by the agent. However, this is not true in many realistic settings, e.g., a human observer may not always be able to provide rewards, a sensor to observe rewards may be limited or broken, or rewards may be unavailable during deployment. Monitored Markov decision processes (MonMDPs) have recently been proposed as a model of such settings. Yet, Mon-MDP algorithms developed thus far do not fully exploit the problem structure, cannot take advantage of a known monitor, have no worst-case guarantees for "unsolvable" Mon-MDPs without specific initialization, and only have asymptotic proofs of convergence. This paper makes three contributions. First, we introduce a model-based algorithm for Mon-MDPs that addresses all of these shortcomings. The algorithm uses two instances of model-based interval estimation, one to guarantee that observable rewards are indeed observed, and another to learn the optimal policy. Second, empirical results demonstrate these advantages, showing faster convergence than prior algorithms in over two dozen benchmark settings, and even more dramatic improvements when the monitor process is known.Third, we present the first finite-sample bound on performance and show convergence to an optimal worst-case policy when some rewards are never observable.
Paperid:2472
Authors:Chao Li · Jiawei Fan · Anbang Yao
Abstract:
In this paper, we present Morse, a simple and universal framework for accelerating diffusion models. The key insight of Morse is to reformulate the iterative generation (from noise to data) process via taking advantage of fast jump sampling and adaptive residual feedback strategies. Specifically, Morse involves two models called Dash and Dot that interact with each other. The Dash model is just the pretrained diffusion model of any type, but operates in a jump sampling regime, creating sufficient space for sampling efficiency improvement. The Dot model is significantly faster than the Dash model, which is learnt to generate residual feedback conditioned on the observations at the current jump sampling point on the trajectory of the Dash model, lifting the noise estimate to easily match the next-step estimate of the Dash model without jump sampling. By chaining the outputs of the Dash and Dot models run in a time-interleaved fashion, Morse exhibits the merit of flexibly attaining desired image generation performance while improving overall runtime efficiency. With our proposed weight sharing strategy between the Dash and Dot models, Morse is efficient for training and inference. We validate the efficacy of our method under a variety of experimental setups. Our method shows an average speedup of 1.78× to 3.31× over a wide range of sampling step budgets relative to baseline diffusion models. Furthermore, we show that our method can be also generalized to improve the Latent Consistency Model (LCM-SDXL, which is already accelerated with consistency distillation technique) tailored for few-step text-to-image synthesis. The code will be made publicly available.
Paperid:2473
Authors:Elizabeth M Fons Etcheverry · Alejandro Sztrajman · Yousef El-Laham · Luciana Ferrer · Svitlana Vyetrenko · Manuela Veloso
Abstract:
Time series imputation with missing or irregularly sampled data is a persistent challenge in machine learning. Most frequencydomain methods rely on the Fast Fourier Transform (FFT), which assumes uniform sampling, therefore requiring interpolation or imputation prior to frequency estimation. We propose a novel diffusion-based imputation approach (LSCD) that leverages Lomb--Scargle periodograms to robustly handle missing and irregular samples without requiring interpolation or imputation in the frequency domain. Our method trains a score-based diffusion model \emph{conditioned on the entire signal spectrum}, enabling direct usage of irregularly spaced observations. Experiments on synthetic and real-world benchmarks demonstrate that our method recovers missing data more accurately than purely time-domain baselines, while simultaneously producing consistent frequency estimates. Crucially, our framework paves the way for broader adoption of Lomb--Scargle methods in machine learning tasks involving incomplete or irregular data.
Paperid:2474
Authors:Marc Jourdan · Gizem Yüce · Nicolas Flammarion
Abstract:
Abstract:Recent advances in language modeling have underscored the role of preference feedback in enhancing model performance. This paper investigates the conditions under which preference feedback improves parameter estimation in classes of continuous parametric distributions. In our framework, the learner observes pairs of samples from an unknown distribution along with their relative preferences depending on the same unknown parameter. We show that preferencesbased M-estimators achieve a better asymptotic variance than sample-only M-estimators, further improved by deterministic preferences. Leveraging the hard constraints revealed by deterministic preferences, we propose an estimator achieving an estimation error scaling of $\mathcal{O}(1/n)$---a significant improvement over the $\Theta(1/\sqrt{n})$ rate attainable with samples alone. Next, we establish a lower bound that matches this accelerated rate; up to problem-dependent constants. While the assumptions underpinning our analysis are restrictive, they are satisfied by notable cases such as Gaussian or Laplace distributions for preferences based on the log-probability reward.
Paperid:2475
Authors:Shaokun Zhang · Ming Yin · Jieyu Zhang · Jiale Liu · Zhiguang Han · Jingyang Zhang · Beibin Li · Chi Wang · Huazheng Wang · Yiran Chen · Qingyun Wu
Abstract:
Failure attribution in LLM multiagent systems—identifying the agent and step responsible for task failures—provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems.To support this initiative, we introduce the Who\&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps.Using the Who\&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5\% accuracy in identifying failure-responsible agents but only 14.2\% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area.
Paperid:2476
Authors:Paris Giampouras · HanQin Cai · Rene Vidal
Abstract:
In this paper, we focus on a matrix factorizationbased approach for robust recovery of low-rank asymmetric matrices from corrupted measurements. We propose an Overparameterized Preconditioned Subgradient Algorithm (OPSA) and provide, for the first time in the literature, linear convergence rates independent of the rank of the sought asymmetric matrix in the presence of gross corruptions. Our work goes beyond existing results in preconditioned-type approaches addressing their current limitation, i.e., the lack of convergence guarantees in the case of asymmetric matrices of unknown rank. By applying our approach to (robust) matrix sensing, we highlight its merits when the measurement operator satisfies a mixed-norm restricted isometry property. Lastly, we present extensive numerical experiments that validate our theoretical results and demonstrate the effectiveness of our approach for different levels of overparameterization and corruption from outliers.
Paperid:2477
Authors:Ruben Weitzman · Peter Mørch Groth · Lood van Niekerk · Aoi Otani · Yarin Gal · Debora Marks · Pascal Notin
Abstract:
Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and proteinprotein interactions. Traditional workflows have relied on a two-step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training models on one or more of these alignments. However, MSA-based retrieval is computationally expensive, struggles with highly divergent sequences and complex insertions/deletions, and operates independently of downstream modeling. We introduce Protriever, an end-to-end differentiable framework that unifies retrieval and task modeling. Focusing on protein fitness prediction, we show that Protriever achieves performance on par with the most sensitive MSA-based tools while being orders of magnitude faster at retrieval, as it relies on efficient vector search. Protriever is both architecture- and task-agnostic, and can flexibly adapt to different retrieval strategies and protein databases at inference—offering a scalable alternative to alignment-centric approaches.
Paperid:2478
Authors:Sungwon Han · Seungeon Lee · MEEYOUNG CHA · Sercan Arik · Jinsung Yoon
Abstract:
Time series forecasting uses historical data to predict future trends, leveraging the relationships between past observations and available features. In this paper, we propose, RAFT, a retrievalaugmented time series forecasting method to provide sufficient inductive biases and complement the model's learning capacity. When forecasting the subsequent time frames, we directly retrieve historical data candidates from the training dataset with patterns most similar to the input, and utilize the future values of these candidates alongside the inputs to obtain predictions. This simple approach augments the model's capacity by externally providing information about past patterns via retrieval modules. Our empirical evaluations on ten benchmark datasets show that RAFT consistently outperforms contemporary baselines with an average win ratio of 86%.
Paperid:2479
Authors:Yunyi Shen · Hao Sun · Jean-Francois Ton
Abstract:
Building neural reward models from human preferences is a pivotal component in reinforcement learning from human feedback (RLHF) and large language model alignment research. Given the scarcity and high cost of human annotation, how to select the most informative pairs to annotate is an essential yet challenging open problem. In this work, we highlight the insight that an ideal comparison dataset for reward modeling should balance exploration of the representation space and make informative comparisons between pairs with moderate reward differences. Technically, challenges arise in quantifying the two objectives and automatically selecting comparisons to be annotated. To address this, we propose the Fisher informationbased selection strategies, adapt theories from classical experimental design literature, and apply them to the final linear layer of the deep neural network-based reward modeling tasks. Empirically, our method demonstrates remarkable performance, high computational efficiency, and stability compared to various other selection methods from both deep learning and classical statistical literature across open-source LLMs and datasets. Further ablation studies reveal that incorporating cross-prompt comparisons in active reward modeling significantly enhances labeling efficiency, shedding light on the potential for improved annotation strategies in RLHF.
Paperid:2480
Authors:Aymeric Capitaine · Etienne Boursier · Eric Moulines · Michael Jordan · Alain Oliviero Durmus
Abstract:
The framework of uncoupled online learning in multiplayer games has made significant progress in recent years. In particular, the development of timevarying games has considerably expanded its modeling capabilities. However, current regret bounds quickly become vacuous when the game undergoes significant variations over time, even when these variations are easy to predict. Intuitively, the ability of players to forecast future payoffs should lead to tighter guarantees, yet existing approaches fail to incorporate this aspect. This work aims to fill this gap by introducing a novel prediction-aware framework for time-varying games, where agents can forecast future payoffs and adapt their strategies accordingly. In this framework, payoffs depend on an underlying state of nature that agents predict in an online manner. To leverage these predictions, we propose the POWMU algorithm, a contextual extension of the optimistic Multiplicative Weight Update algorithm, for which we establish theoretical guarantees on social welfare and convergence to equilibrium. Our results demonstrate that, under bounded prediction errors, the proposed framework achieves performance comparable to the static setting. Finally, we empirically demonstrate the effectiveness of POWMU in a traffic routing experiment.
Paperid:2481
Authors:Yuan Feng · Yukun Cao · Hairu Wang · Xike Xie · S Kevin Zhou
Abstract:
Sketches, probabilistic structures for estimating item frequencies in infinite data streams with limited space, are widely used across various domains. Recent studies have shifted the focus from handcrafted sketches to neural sketches, leveraging memoryaugmented neural networks (MANNs) to enhance the streaming compression capabilities and achieve better space-accuracy trade-offs.However, existing neural sketches struggle to scale across different data domains and space budgets due to inflexible MANN configurations. In this paper, we introduce a scalable MANN architecture that brings to life the Lego sketch, a novel sketch with superior scalability and accuracy.Much like assembling creations with modular Lego bricks, the Lego sketch dynamically coordinates multiple memory bricks to adapt to various space budgets and diverse data domains.Theoretical analysis and empirical studies demonstrate its scalability and superior space-accuracy trade-offs, outperforming existing handcrafted and neural sketches.
Paperid:2482
Authors:Zhuofan Zong · Dongzhi Jiang · Bingqi Ma · Guanglu Song · Hao Shao · Dazhong Shen · Yu Liu · Hongsheng Li
Abstract:
Significant achievements in personalization of diffusion models have been witnessed. Conventional tuningfree methods mostly encode multiple reference images by averaging or concatenating their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although tuning-based approaches can effectively extract consistent elements within multiple images through the training process, it necessitates test-time finetuning for each distinct image group. This paper introduces EasyRef, a plug-and-play adaption method that empowers diffusion models to condition consistent visual elements (e.g., style and human facial identity, etc.) across multiple reference images under instruction controls. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM's representations into the diffusion process through adapters can easily generalize to unseen domains. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free and tuning-based methods, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.
Paperid:2483
Authors:Alexander Atanasov · Jacob A Zavatone-Veth · Cengiz Pehlevan
Abstract:
Recent years have seen substantial advances in our understanding of highdimensional ridge regression, but existing theories assume that training examples are independent. By leveraging techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.
Paperid:2484
Authors:Drew Prinster · Xing Han · Anqi Liu · Suchi Saria
Abstract:
Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in highstakes settings arguably requires not only proof of system reliability, but moreover continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Statistical methods for nonparametric change-point detection---especially the tools of conformal test martingales (CTMs) and anytime-valid inference---offer promising approaches to this monitoring task. However, existing methods are restricted to monitoring limited hypothesis classes or ``alarm criteria,'' such as data shifts that violate certain exchangeability assumptions, or do not allow for online adaptation in response to shifts. In this paper, we expand the scope of these monitoring methods by proposing a weighted generalization of conformal test martingales (WCTMs), which lay a theoretical foundation for online monitoring for any unexpected changepoints in the data distribution while controlling false-alarms. For practical applications, we propose specific WCTM algorithms that accommodate online adaptation to mild covariate shifts (in the marginal input distribution) while raising alarms in response to more severe shifts, such as concept shifts (in the conditional label distribution) or extreme (out-of-support) covariate shifts that cannot be easily adapted to. On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines.
Paperid:2485
Authors:Thibaut Issenhuth · Sangchul Lee · Ludovic Dos Santos · Jean-Yves Franceschi · Chansoo Kim · alain rakotomamonjy
Abstract:
Consistency models imitate the multistep sampling of score-based diffusion in a single forward pass of a neural network.They can be learned in two ways: consistency distillation and consistency training. The former relies on the true velocity field of the corresponding differential equation, approximated by a pre-trained neural network.In contrast, the latter uses a single-sample Monte Carlo estimate of this velocity field.The related estimation error induces a discrepancy between consistency distillation and training that, we show, still holds in the continuous-time limit.To alleviate this issue, we propose a novel flow that transports noisy data towards their corresponding outputs derived from a consistency model.We prove that this flow reduces the previously identified discrepancy and the noise-data transport cost.Consequently, our method not only accelerates consistency training convergence but also enhances its overall performance.
Paperid:2486
Authors:Weilin Cai · Juyong Jiang · Le Qin · junweicui · Sunghun Kim · Jiayi Huang
Abstract:
Abstract:Expert parallelism has emerged as a key strategy for distributing the computational workload of sparselygated mixture-of-experts (MoE) models across multiple devices, enabling the processing of increasingly large-scale models. However, the All-to-All communication inherent to expert parallelism poses a significant bottleneck, limiting the efficiency of MoE models. While existing optimization methods partially mitigate this issue, they remain constrained by the sequential dependency between communication and computation operations. To address this challenge, we propose ScMoE, a novel shortcut-connected MoE architecture integrated with an overlapping parallelization strategy. ScMoE decouples communication from its conventional sequential ordering, enabling up to 100% overlap with computation.Compared to the prevalent top-2 MoE baseline, ScMoE achieves speedups of 1.49$\times$ in training and 1.82$\times$ in inference.Moreover, our experiments and analyses indicate that ScMoE not only achieves comparable but in some instances surpasses the model quality of existing approaches.
Paperid:2487
Authors:Hongrui Peng · Haolang Lu · Kun Wang · Guoshun Nan · Yuanlong Yu · WeiYe Fu
Abstract:
Knowledge graphs (KGs) are ubiquitous in numerous realworld applications, and watermarking facilitates protecting intellectual property and preventing potential harm from AI-generated content. Existing watermarking methods mainly focus on static plain text or image data, while they can hardly be applied to dynamic graphs due to spatial and temporal variations of structured data. This motivates us to propose KGMark, the first graph watermarking framework that aims to generate robust, detectable, and transparent diffusion fingerprints for dynamic KG data. Specifically, we propose a novel clustering-based alignment method to adapt the watermark to spatial variations. Meanwhile, we present a redundant embedding strategy to harden the diffusion watermark against various attacks, facilitating the robustness of the watermark to the temporal variations. Additionally, we introduce a novel learnable mask matrix to improve the transparency of diffusion fingerprints. By doing so, our KGMark properly tackles the variation challenges of structured data. Experiments on various public benchmarks show the effectiveness of our proposed KGMark.
Paperid:2488
Authors:Alex Kokot · Octavian-Vlad Murad · Marina Meila
Abstract:
Abstract:In this paper, we clarify the effect of noise on common spectrallymotivated algorithms such as Diffusion Maps (DM) for dimensionreduction. Empirically, these methods are much more robust to noisethan current work suggests. Specifically, existing consistency resultsrequire that either the noise amplitude or dimensionality must varywith the sample size $n$. We provide new theoretical resultsdemonstrating that lowfrequency eigenpairs reliably capture thegeometry of the underlying manifold under a constant noise level, up to a dimension independent threshold $O(r^{-2})$, where $r$ is the noise amplitude. Our results rely on a decomposition of the manifold Laplacian in the Sasakimetric, a technique not used before in this area, to our knowledge. We experimentally validate our theoretical predictions. Additionally, we observesimilar robust behavior for other manifold learning algorithms which are not based on computing the Laplacian, namely LTSA and VAE.
Paperid:2489
Authors:Ali Behrouz · Ali Parviz · Mahdi Karami · Clayton Sanford · Bryan Perozzi · Vahab Mirrokni
Abstract:
Modern sequence models (e.g., Transformers and linear RNNs) emerged as dominant backbones of recent deep learning frameworks, mainly due to their efficiency, representational power, and/or ability to capture longrange dependencies. Recently, adopting these sequence models for graph-structured data has gained popularity as the alternative to Message Passing Neural Networks (MPNNs). There is, however, a lack of a common foundation about what constitutes a good graph sequence model, and a mathematical description of the benefits and deficiencies in adopting different sequence models for learning on graphs. To this end, we introduce the Graph Sequence Model (GSM), a unifying framework for applying sequence models to graph data. The GSM framework allows us to understand, evaluate, and compare the power of different sequence model backbones in graph tasks. Building on this insight, we propose GSM++, a fast hybrid model that hierarchically tokenizes the graph using Hierarchical Affinity Clustering (HAC) and then encodes these sequences via a hybrid architecture. The theoretical and experimental findings confirm that GSM++ outperforms baseline models on most benchmarks.
Paperid:2490
Authors:Wanli Hong · Yuliang Shi · Jonathan Niles-Weed
Abstract:
Motivated by applications in trajectory inference and particle tracking, we introduceSmooth Schrödinger Bridges. Our proposal generalizes prior work by allowing the reference process in the Schrödinger Bridge problem to be a smooth Gaussian process, leading to more regular and interpretable trajectories in applications. Though naïvely smoothing the reference process leads to a computationally intractable problem, we identify a class of processes for which the resulting Smooth Schrödinger Bridge problem can belifteda simpler problem on phase space, which can be solved in polynomial time. We develop a practical approximation of this algorithm that outperforms existing methods by 210x on numerous simulated and real single-cell RNAseq datasets.
Paperid:2491
Authors:Sudeepta Mondal · Zhuolin Jiang · Ganesh Sundaramoorthi
Abstract:
We present a theory for construction of outof-distribution (OOD) detection features for OODin neural networks. We introduce random features for OOD and formulate a novel informationtheoretic loss functional consisting of two terms, the first based on the KL divergence separatesresulting feature distributions under in and out of distributions and the second is the informationbottleneck, which favors compressed features that retain the OOD information. We formulate avariational optimization procedure to derive OOD features. We show that based on underlyingassumptions on the OOD distributions and the random features, one can recover propertiesof shaping functions present in existing state of the art methods for OOD. Furthermore, weshow that our theory can predict new shaping functions that lead to state of the art results onOOD benchmarks. The theory provides a general framework for constructing a variety of newfeatures with clear interpretability.
Paperid:2492
Authors:Leo Brunswic · Mateo Clémente · Rui Heng Yang · Adam Sigal · Amir Rasouli · Yinchuan Li
Abstract:
Generative Flow Networks (GFNs) were initially introduced on directed nonacyclic graphs to sample from an unnormalized distribution density. Recent works have extended the theoretical framework for generative methods allowing more flexibility and enhancing application range. However, many challenges remain in training GFNs in continuous settings and for imitation learning (IL), including intractability of flow-matching loss, limited tests of non-acyclic training, and the need for a separate reward model in imitation learning. The present work proposes a family of generative flows called Ergodic Generative Flows (EGFs) which are used to address the aforementioned issues. First, we leverage ergodicity to build simple generative flows with finitely many globally defined transformations (diffeomorphisms) with universality guarantees and tractable flow-matching loss (FM loss). Second, we introduce a new loss involving cross-entropy coupled to weak flow-matching control, coined KL-weakFM loss. It is designed for IL training without a separate reward model. We evaluate IL-EGFs on toy 2D tasks and real-world datasets from NASA on the sphere, using the KL-weakFM loss. Additionally, we conduct toy 2D reinforcement learning experiments with a target reward, using the FM loss.
Paperid:2493
Authors:Kulin Shah · Alkis Kalavasis · Adam Klivans · Giannis Daras
Abstract:
There is strong empirical evidence that the stateofthe-art diffusion modeling paradigm leads to models that memorize the training set, especially when the training set is small. Prior methods to mitigate the memorization problem often lead to decrease in image quality. Is it possible to obtain strong and creative generative models, i.e., models that achieve high generation quality and low memorization? Despite the current pessimistic landscape of results, we make significant progress in pushing the trade-off between fidelity and memorization. We first provide theoretical evidence that memorization in diffusion models is only necessary for denoising problems at low noise scales (usually used in generating high-frequency details). Using this theoretical insight, we propose a simple, principled method to train the diffusion models using noisy data at large noise scales. We show that our method significantly reduces memorization without decreasing the image quality, for both text-conditional and unconditional models and for a variety of data availability settings.
Paperid:2494
Authors:Zihan Guan · Mengxuan Hu · Ronghang Zhu · Sheng Li · Anil Vullikanti
Abstract:
Recent studies have uncovered a troubling vulnerability in the finetuning stage of large language models (LLMs): even fine-tuning on entirely benign datasets can lead to a significant increase in the harmfulness of LLM outputs. Building on this finding, our red teaming study takes this threat one step further by developing a more effective attack. Specifically, we analyze and identify samples within benign datasets that contribute most to safety degradation, then fine-tune LLMs exclusively on these samples. We approach this problem from an outlier detection perspective and propose Self-Inf-N, to detect and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs on 100 outlier samples selected by Self-Inf-N in the benign datasets severely compromises LLM safety alignment. Extensive experiments across \textbf{seven} mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards.
Paperid:2495
Authors:Zhiyuan Yan · Jiangming Wang · Peng Jin · Ke-Yue Zhang · Chengchun Liu · Shen Chen · Taiping Yao · Shouhong Ding · Baoyuan Wu · Li Yuan
Abstract:
AIgenerated images (AIGIs), such as natural or face images, have become increasingly realistic and indistinguishable, making their detection a critical and pressing challenge.In this paper, we start from a new perspective to excavate the reason behind the failure generalization in AIGI detection, named theasymmetry phenomenon, where a naively trained detector tends to favor overfitting to the limited and monotonous fake patterns, causing the feature space to become highly constrained and low-ranked, which is proved seriously limiting the expressivity and generalization.One potential remedy is incorporating the pre-trained knowledge within the vision foundation models (higher-ranked) to expand the feature space, alleviating the model's overfitting to fake.To this end, we employ Singular Value Decomposition (SVD) to decompose the original feature space into two orthogonal subspaces.By freezing the principal components and adapting only the remained components, we preserve the pre-trained knowledge while learning forgery-related patterns.Compared to existing full-parameters and LoRA-based tuning methods, we explicitly ensure orthogonality enabling the higher rank of the whole feature space, effectively minimizing overfitting and enhancing generalization.Extensive experiments with our deep analysis on both deepfake and synthetic image detection benchmarks demonstrate superior generalization performance in detection.
Paperid:2496
Authors:Dominik Fuchsgruber · Tom Wollschläger · Johannes Bordne · Stephan Günnemann
Abstract:
While uncertainty estimation for graphs recently gained traction, most methods rely on homophily and deteriorate in heterophilic settings. We address this by analyzing message passing neural networks from an informationtheoretic perspective and developing a suitable analog to data processing inequality to quantify information throughout the model's layers. In contrast to non-graph domains, information about the node-level prediction target canincreasewith model depth if a node's features are semantically different from its neighbors. Therefore, on heterophilic graphs, the latent embeddings of an MPNN each provide different information about the data distribution - different from homophilic settings. This reveals that considering all node representations simultaneously is a key design principle for epistemic uncertainty estimation on graphs beyond homophily. We empirically confirm this with a simple post-hoc density estimator on the joint node embedding space that provides state-of-the-art uncertainty on heterophilic graphs. At the same time, it matches prior work on homophilic graphs without explicitly exploiting homophily through post-processing.
Paperid:2497
Authors:Erik Jones · Anca Dragan · Jacob Steinhardt
Abstract:
Developers try to evaluate whether an AI system can accomplish malicious tasks before releasing it; for example, they might test whether a model enables cyberoffense, user manipulation, or bioterrorism. In this work, we show that individually testing models for such misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. The adversary accomplishes this by first decomposing tasks into subtasks, then solving each subtask with the bestsuited model. For example, an adversary might solve challenging-but-benign subtasks with an aligned frontier model, and easy-but-malicious subtasks with a weaker misaligned model. We study two decomposition methods: manual decomposition where a human identifies a natural decomposition of a task, and automated decomposition where a weak model generates benign tasks for a frontier model to solve, then uses the solutions in-context to solve the original task. Using these decompositions, we empirically show that adversaries can create vulnerable code, explicit images, python scripts for hacking, and manipulative tweets at much higher rates with combinations of models than either individual model. Our work suggests that even perfectly-aligned frontier systems enable misuse without ever producing malicious outputs, and that red-teaming efforts should extend beyond single models in isolation.
Paperid:2498
Authors:Haoxiong Liu · Jiacheng Sun · Zhenguo Li · Andrew Yao
Abstract:
The synergy between deep learning models and traditional automation tools plays a pivotal role in developing robust neural theorem provers (NTPs). However, for proof synthesis with LLMs, previous work applies automation tools either only when the model explicitly calls the method, or only at a single granularity level, failing to fully exploit the power of builtin tactics and off-the-shelf automated theorem provers.In this work, we propose ProofAug, a novel theorem proving method that enjoys superior sample efficiency through equipping proof-generation LLMs with automation methods in different granularities via fine-grained structure analysis of model-generated proof proposals.Furthermore, ProofAug serves as a versatile plug-and-play module that seamlessly integrates with any tree-search algorithm, enabling our construction of an efficient recursive proving (ERP) module to further enhance performance.The superiority of our method is validated on the miniF2F-test benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant.Notably, by additionally employing a mixed prompting strategy, we achieve a cumulative pass rate of 66.0% after curation of the dataset (61.9% for the original version), setting a new SOTA across all proof languages with a total sample budget of only 2100.
Paperid:2499
Authors:Yilin Wang · Peixuan Lei · chen tao · Jie Song · Haoyuzhe · Yuxuan Zhang · LEI JIA · Yuanxiang Li · Zhongyu Wei
Abstract:
Timeseries data are critical in diverse applications, such as industrial monitoring, medical diagnostics, and climate research. However, effectively integrating these high-dimensional temporal signals with natural language for dynamic, interactive tasks remains a significant challenge. To address this, we introduce the Time-Series Question Answering (Time-Series QA) task and release EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset designed to capture complex interactions between time-series signals and natural language. Building on this resource, we propose the Instruct Time Transformer (ITFormer), a novel framework that bridges time-series encoders with frozen large language models (LLMs). ITFormer effectively extracts, aligns, and fuses temporal and textual features, achieving a strong improvement in QA accuracy over strong baselines with fewer than 1\% additional trainable parameters. By combining computational efficiency with robust cross-modal modeling, our work establishes a adaptable paradigm for integrating temporal data with natural language, paving the way for new research and applications in multi-modal AI.
Paperid:2500
Authors:Zeen Song · Siyu Zhao · Xingyu Zhang · Jiangmeng Li · Changwen Zheng · Wenwen Qiang
Abstract:
Contrastive LanguageImage Pretraining (CLIP) has achieved remarkable success, but its performance can degrade when fine-tuned in out-of-distribution (OOD) scenarios. We model the prediction process using a Structural Causal Model (SCM) and show that the causal mechanism involving both invariant and variant factors in training environments differs from that in test environments. In contrast, the causal mechanism with solely invariant factors remains consistent across environments. We theoretically prove the existence of a linear mapping from CLIP embeddings to invariant factors, which can be estimated using interventional data. Additionally, we provide a condition to guarantee low OOD risk of the invariant predictor. Based on these insights, we propose the Invariant Causal Mechanism of CLIP (CLIP-ICM) framework. CLIP-ICM involves collecting interventional data, estimating a linear projection matrix, and making predictions within the invariant subspace. Experiments on several OOD datasets show that CLIP-ICM significantly improves the performance of CLIP. Our method offers a simple but powerful enhancement, boosting the reliability of CLIP in real-world applications.
Paperid:2501
Authors:Julius Von Rohrscheidt · Bastian Rieck
Abstract:
The Euler Characteristic Transform (ECT) is an efficiently computable geometricaltopological invariant that characterizes the global shape of data.In this paper, we introduce the local Euler Characteristic Transform (l-ECT), a novel extension of the ECT designed to enhance expressivity and interpretability in graph representation learning.Unlike traditional Graph Neural Networks (GNNs), which may lose critical local details through aggregation, the l-ECT provides a lossless representation of local neighborhoods.This approach addresses key limitations in GNNs by preserving nuanced local structures while maintaining global interpretability.Moreover, we construct a rotation-invariant metric based on l-ECTs for spatial alignment of data spaces.Our method demonstrates superior performance compared to standard GNNs on various benchmarking node classification tasks, while also offering theoretical guarantees of its effectiveness.
Paperid:2502
Authors:François Bachoc · Tommaso Cesari · Roberto Colomboni
Abstract:
We study the role of contextual information in the online learning problem of brokerage between traders.In this sequential problem, at each time step, two traders arrive with secret valuations about an asset they wish to trade.The learner (a broker) suggests a trading (or brokerage) price based on contextual data about the asset and the market conditions.Then, the traders reveal their willingness to buy or sell based on whether their valuations are higher or lower than the brokerage price.A trade occurs if one of the two traders decides to buy and the other to sell, i.e., if the broker's proposed price falls between the smallest and the largest of their two valuations.We design algorithms for this problem and prove optimal theoretical regret guarantees under various standard assumptions.
Paperid:2503
Authors:Yonatan Sommer · Ivri Hikri · lotan amit · Nir Rosenfeld
Abstract:
When learning is used to inform decisions about humans, such as for loans, hiring, or admissions, this can incentivize users to strategically modify their features to obtain positive predictions. A key assumption is that modifications are costly, and are governed by a cost function that is exogenous and predetermined. We challenge this assumption, and assert that the deployment of a classifier is whatcreates costs. Our idea is simple: when users seek positive predictions, this creates demand for important features; and if features are available for purchase, then a market will form, and competition will give rise to prices. We extend the strategic classification framework to support this notion, and study learning in a setting where a classifier can induce a market for features. We present an analysis of the learning task, devise an algorithm for computing market prices, propose a differentiable learning framework, and conduct experiments to explore our novel setting and approach.
Paperid:2504
Authors:Victor Dheur · Matteo Fontana · Yorick Estievenart · Naomi Desobry · Souhaib Ben Taieb
Abstract:
Conformal prediction provides a powerful framework for constructing distributionfree prediction regions with finite-sample coverage guarantees. While extensively studied in univariate settings, its extension to multi-output problems presents additional challenges, including complex output dependencies and high computational costs, and remains relatively underexplored. In this work, we present a unified comparative study of nine conformal methods with different multivariate base models for constructing multivariate prediction regions within the same framework. This study highlights their key properties while also exploring the connections between them. Additionally, we introduce two novel classes of conformity scores for multi-output regression that generalize their univariate counterparts. These scores ensure asymptotic conditional coverage while maintaining exact finite-sample marginal coverage. One class is compatible with any generative model, offering broad applicability, while the other is computationally efficient, leveraging the properties of invertible generative models. Finally, we conduct a comprehensive empirical evaluation across 13 tabular datasets, comparing all the multi-output conformal methods explored in this work. To ensure a fair and consistent comparison, all methods are implemented within a unified code base.
Paperid:2505
Authors:Li gengluo · Huawen Shen · Yu ZHOU
Abstract:
Chinese scene text retrieval is a practical task that aims to search for images containing visual instances of a Chinese query text. This task is extremely challenging because Chinese text often features complex and diverse layouts in realworld scenes. Current efforts tend to inherit the solution for English scene text retrieval, failing to achieve satisfactory performance. In this paper, we establish a Diversified Layout benchmark for Chinese Street View Text Retrieval (DL-CSVTR), which is specifically designed to evaluate retrieval performance across various text layouts, including vertical, cross-line, and partial alignments. To address the limitations in existing methods, we propose Chinese Scene Text Retrieval CLIP (CSTR-CLIP), a novel model that integrates global visual information with multi-granularity alignment training. CSTR-CLIP applies a two-stage training process to overcome previous limitations, such as the exclusion of visual features outside the text region and reliance on single-granularity alignment, thereby enabling the model to effectively handle diverse text layouts. Experiments on existing benchmark show that CSTR-CLIP outperforms the previous state-of-the-art model by 18.82% accuracy and also provides faster inference speed. Further analysis on DL-CSVTR confirms the superior performance of CSTR-CLIP in handling various text layouts. The dataset and code will be publicly available to facilitate research in Chinese scene text retrieval.
Paperid:2506
Authors:Yuanchen Wu · Ke Yan · Shouhong Ding · Ziyin Zhou · Xiaoqiang Li
Abstract:
Large VisionLanguage Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight “rationale fine-tuning” approach, which modifies the model’s response format to require a rationale before deriving answer without explicit prompts. Next, SRC searches a diverse set of candidate responses from the fine-tuned LVLMs for each sample, followed by a proposed pairwise scoring strategy using a tailored scoring model, R-Scorer, to evaluate both rationale quality and factual consistency of candidates. Based on a confidence-weighted preference curation process, SRC decouples the alignment calibration into a preference fine-tuning manner, leading to significant improvements of LVLMs in perception, reasoning, and generalization across multiple benchmarks. Our results emphasize the rationale-oriented alignment in exploring the potential of LVLMs.
Paperid:2507
Authors:Beom Jin Kang · NamJoon Kim · Hyun Kim
Abstract:
Recently, transformerbased models have demonstrated state-of-the-art performance across various computer vision tasks, including image classification, detection, and segmentation. However, their substantial parameter count poses significant challenges for deployment in resource-constrained environments such as edge or mobile devices. Low-rank approximation (LRA) has emerged as a promising model compression technique, effectively reducing the number of parameters in transformer models by decomposing high-dimensional weight matrices into low-rank representations. Nevertheless, matrix decomposition inherently introduces information loss, often leading to a decline in model accuracy. Furthermore, existing studies on LRA largely overlook the quantization process, which is a critical step in deploying practical vision transformer (ViT) models. To address these challenges, we propose a robust LRA framework that preserves weight information after matrix decomposition and incorporates quantization tailored to LRA characteristics. First, we introduce a reparameterizable branch-based low-rank approximation (RB-LRA) method coupled with weight reconstruction to minimize information loss during matrix decomposition. Subsequently, we enhance model accuracy by integrating RB-LRA with knowledge distillation techniques. Lastly, we present an LRA-aware quantization method designed to mitigate the large outliers generated by LRA, thereby improving the robustness of the quantized model. To validate the effectiveness of our approach, we conducted extensive experiments on the ImageNet dataset using various ViT-based models. Notably, the Swin-B model with RB-LRA achieved a 31.8\% reduction in parameters and a 30.4\% reduction in GFLOPs, with only a 0.03\% drop in accuracy. Furthermore, incorporating the proposed LRA-aware quantization method reduced accuracy loss by an additional 0.83\% compared to naive quantization.
Paperid:2508
Authors:Canzhe Zhao · Yutian Cheng · Jing Dong · Baoxiang Wang · Shuai Li
Abstract:
Abstract:We investigate learning approximate Nash equilibrium (NE) policy profiles in twoplayer zero-sum imperfect information extensive-form games (IIEFGs) with last-iterate convergence guarantees. Existing algorithms either rely on full-information feedback or provide only asymptotic convergence rates. In contrast, we focus on the bandit feedback setting, where players receive feedback solely from the rewards associated with the experienced information set and action pairs in each episode. Our proposed algorithm employs a negentropy regularizer weighted by a "virtual transition" over the information set-action space. Through a carefully designed virtual transition and leveraging the entropy regularization technique, we demonstrate finite-time last-iterate convergence to the NE with a rate of $\widetilde{\mathcal{O}}(k^{-\frac{1}{8}})$ under bandit feedback in each episode $k$. Additionally, our algorithm admits an efficient closed-form solution for approximate policy updates. Empirical evaluations across various IIEFG instances show its competitive performance compared to baseline methods.
Paperid:2509
Authors:Timo Kaiser · Thomas Norrenbrock · Bodo Rosenhahn
Abstract:
The introduction of the Segment Anything Model (SAM) has paved the way for numerous semantic segmentation applications. For several tasks, quantifying the uncertainty of SAM is of particular interest. However, the ambiguous nature of the classagnostic foundation model SAM challenges current uncertainty quantification (UQ) approaches. This paper presents a theoretically motivated uncertainty quantification model based on a Bayesian entropy formulation jointly respecting aleatoric, epistemic, and the newly introduced task uncertainty. We use this formulation to train USAM, a lightweight post-hoc UQ method. Our model traces the root of uncertainty back to under-parameterised models, insufficient prompts or image ambiguities. Our proposed deterministic USAM demonstrates superior predictive capabilities on the SA-V, MOSE, ADE20k, DAVIS, and COCO datasets, offering a computationally cheap and easy-to-use UQ alternative that can support user-prompting, enhance semi-supervised pipelines, or balance the tradeoff between accuracy and cost efficiency.
Paperid:2510
Authors:Yunhong Lu · Qichao Wang · Hengyuan Cao · Xiaoyin Xu · Min Zhang
Abstract:
Direct Preference Optimization (DPO) aligns textto-image (T2I) generation models with human preferences using pairwise preference data. Although substantial resources are expended in collecting and labeling datasets, a critical aspect is often neglected:preferences vary across individuals and should be represented with more granularity.To address this, we propose SmPO-Diffusion, a novel method for modeling preference distributions to improve the DPO objective, along with a numerical upper bound estimation for the diffusion optimization objective. First, we introduce a smoothed preference distribution to replace the original binary distribution. We employ a reward model to simulate human preferences and apply preference likelihood averaging to improve the DPO loss, such that the loss function approaches zero when preferences are similar. Furthermore, we utilize an inversion technique to simulate the trajectory preference distribution of the diffusion model, enabling more accurate alignment with the optimization objective. Our approach effectively mitigates issues of excessive optimization and objective misalignment present in existing methods through straightforward modifications. Experimental results demonstrate that our method achieves state-of-the-art performance in offline preference alignment, surpassing baselines in human preference evaluation tasks across various metrics, while significantly reducing the consumption of training resources.
Paperid:2511
Authors:Hongyi Ling · Haiyang Yu · Zhimeng Jiang · Na Zou · Shuiwang Ji
Abstract:
We consider explainability in equivariant graph neural networks for 3D geometric graphs. While many XAI methods have been developed for analyzing graph neural networks, they predominantly target 2D graph structures. The complex nature of 3D data and the sophisticated architectures of equivariant GNNs present unique challenges. Current XAI techniques either struggle to adapt to equivariant GNNs or fail to effectively handle positional data and evaluate the significance of geometric features adequately. To address these challenges, we introduce a novel method, known as EquiGX, which uses the Deep Taylor decomposition framework to extend the layerwise relevance propagation rules tailored for spherical equivariant GNNs. Our approach decomposes prediction scores and back-propagates the relevance scores through each layer to the input space. Our decomposition rules provide a detailed explanation of each layer’s contribution to the network’s predictions, thereby enhancing our understanding of how geometric and positional data influence the model’s outputs. Through experiments on both synthetic and real-world datasets, our method demonstrates its capability to identify critical geometric structures and outperform alternative baselines. These results indicate that our method provides significantly enhanced explanations for equivariant GNNs.
Paperid:2512
Authors:Wenhao Zhao · Qiushui Xu · Linjie Xu · Lei Song · Jinyu Wang · Chunlai Zhou · Jiang Bian
Abstract:
Recently, incorporating knowledge from pretrained language models (PLMs) into decision transformers (DTs) has generated significant attention in offline reinforcement learning (RL). These PLMs perform well in RL tasks, raising an intriguing question: what kind of knowledge from PLMs has been transferred to RL to achieve such good results? This work first dives into this problem by analyzing each head quantitatively and points out Markov head, a crucial component that exists in the attention heads of PLMs. It leads to extreme attention on the lastinput token and performs well only in short-term environments. Furthermore, we prove that this extreme attention cannot be changed by re-training embedding layer or fine-tuning. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pretrained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodate diverse attention requirements during fine-tuning. Extensive experiments demonstrate the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, significantly reduces the performance gap of PLMs in long-term scenarios, and the experimental results also validate our theorems.
Paperid:2513
Authors:Mustafa Burak Gurbuz · Xingyu Zheng · Constantine Dovrolis
Abstract:
As deep learning continues to be driven by everlarger datasets, understanding which examples are most important for generalization has become a critical question. While progress in data selection continues, emerging applications require studying this problem in dynamic contexts. To bridge this gap, we pose the Incremental Data Selection (IDS) problem, where examples arrive as a continuous stream, and need to be selected without access to the full data source. In this setting, the learner must incrementally build a training dataset of predefined size while simultaneously learning the underlying task. We find that in IDS, the impact of a new sample on the model state depends fundamentally on both its geometric relationship in the feature space and its prediction error. Leveraging this insight, we propose PEAKS (Prediction Error Anchored by Kernel Similarity), an efficient data selection method tailored for IDS. Our comprehensive evaluations demonstrate that PEAKS consistently outperforms existing selection strategies. Furthermore, PEAKS yields increasingly better performance returns than random selection as training data size grows on real-world datasets.
Paperid:2514
Authors:Carlos Stein Naves de Brito
Abstract:
Model regularization requires extensive manual tuning to balance complexity against overfitting. Crossregularization resolves this tradeoff by computing validation gradients that directly adapt regularization parameters during training. The method splits parameter optimization - training data guides feature learning while validation data shapes complexity controls - converging provably to cross-validation optima with computational cost scaling only in regularization dimension. When implemented through noise injection in neural networks, this approach reveals striking patterns: unexpectedly high noise tolerance and architecture-specific regularization that emerges organically during training. Beyond complexity control, the framework integrates seamlessly with data augmentation and uncertainty calibration while maintaining single-run efficiency through a simple gradient-based approach.
Paperid:2515
Authors:Cheng Tang · Zhishuai Liu · Pan Xu
Abstract:
Abstract:The Robust Regularized Markov Decision Process (RRMDP) is a recently proposed framework for learning policies robust to dynamics shifts by adding a regularization term on the transition dynamics in the value function. Nevertheless, existing RRMDP methods often rely on unstructured regularization, which can lead to overly conservative policies that consider unrealistic transitions. To address these limitations, we propose a novel framework, the $d$rectangular linear robust regularized Markov decision process ($d$-RRMDP), which introduces a linear latent structure into both transition kernels and regularization. For the offline reinforcement learning setting, where an agent learns robust policies from a pre-collected dataset in the nominal environment, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI), that employ linear function approximation and $f$-divergence-based regularization terms on transition kernels. Our design enables the agent to learn robust policies while maintaining computational efficiency. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, demonstrating that these bounds are influenced by how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. Notably, we establish information-theoretic lower bounds for $d$-RRMDPs, which demonstrate that our approach is fundamental to learning robust policies. Finally, numerical experiments validate that R2PVI learns robust policies and exhibits superior computational efficiency compared to methods designed for d-DRMDPs.
Paperid:2516
Authors:Jieting Wang · ZhangZelong Zhang · Feijiang Li · Yuhua Qian · Xinyan Liang
Abstract:
Abstract:Deep learning has been widely applied due to its powerful representation ability. Intuitively, the quality of representation ability can be measured by sample similarity. Recently, an unsupervised metric was proposed to quantify the information content of similarity matrices, focusing on the discriminative power between sample similarities. However, in classification tasks, classlevel discrimination is more essential than pairwise similarity. In this paper, we propose a novel loss function that measures the discriminative ability of representations by computing the Euclidean distance between the similarity matrix and the true adjacency matrix. The existence of random consistency is proven to affect the category bias and value bias of evaluation results. To mitigate the random consistency in Euclidean distance, we provide the expected Euclidean distance when the permutation labels follow a uniform distribution, and based on this, we provide an analytical solution for pure Euclidean distance. Theoretical analysis shows that pure Euclidean distance exhibits heterogeneity and unbiasedness. In addition, we provide a generalization performance bound for pure Euclidean distance based on the exponential Orlicz norm, verifying the learn ability of pure Euclidean distance. Experimental results show that our method significantly outperforms traditional loss functions, enhancing accuracy, $F_1$ score, and the model's ability to differentiate class structures. (The code will be released.)
Paperid:2517
Authors:Shikai Qiu · Lechao Xiao · Andrew Wilson · Jeffrey Pennington · Atish Agarwala
Abstract:
The recent trend towards large models was sparked by the observation that the computeloss Pareto frontier follows an empirically predictable power law. We demonstrate that the full loss curves across model sizes under compute-optimal training exhibit an even more remarkable universality --- collapsing to a single universal curve after a simple affine rescaling. We dub this phenomenonsupercollapseafter the observation that the deviations from the universal curve are smaller than fluctuations due to stochastic optimization. We show that supercollapse occurs across a large variety of learning rate schedules, and requires accurate estimates of the the compute-optimal data exponent and the irreducible loss — suggesting that supercollapse itself can be used to improve scaling predictions. We prove that training to a power law Pareto frontier is in fact necessary for the emergence of a universal curve, and becomes sufficient for loss curves consisting of certain sums of power laws. By studying the effect of optimization noise, we show applying a learning rate schedules deform the loss curves in a predictable and scale-invariant way by annealing the optimization noise while preserving the collapse. This noise annealing process also explains the high degree of collapse at late times. Taken together, our findings suggest that studying loss curve universality is a valuable path to improve our quantitative understanding of the behavior of large ML systems.
Paperid:2518
Authors:Zeming Wei · Yiwen Guo · Yisen Wang
Abstract:
Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective of studying AT based on classwise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. We find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, we find that the model tends to make decisions based on more class-specific features. This hypothesis is verified over various model architectures and settings. Furthermore, our discovery can provide a unified view of two existing properties of AT: the advantage of soft-label training and robust overfitting. Overall, this discovery refines the current understanding of AT mechanisms and provides new perspectives to study them.
Paperid:2519
Authors:Tin Dizdarevic · Tobias Gessler · Ravi Hammond · Anisoara Calinescu · Jonathan Cook · Matteo Gallici · Andrei Lupu · Jakob Foerster
Abstract:
Achieving seamless coordination between AI agents and humans is crucial for realworld applications, yet it remains a significant open challenge. Hanabi is a cooperative card game featuring imperfect information, constrained communication, theory of mind requirements, and coordinated action -- making it an ideal testbed for human-AI coordination. However, its use for human-AI interaction has been limited by the challenges of human evaluation. In this work, we introduce the Ad-Hoc Human-AI Coordination Challenge (AH2AC2) to overcome the constraints of costly and difficult-to-reproduce human evaluations. We develop \textit{human proxy agents} on a large-scale human dataset that serve as robust, cheap, and reproducible human-like evaluation partners in AH2AC2. To encourage the development of data-efficient methods, we open-source a dataset of 3,079 games, deliberately limiting the amount of available human gameplay data. We present baseline results for both two- and three- player Hanabi scenarios. To ensure fair evaluation, we host the proxy agents through a controlled evaluation system rather than releasing them publicly. Code and data are available as an \href{https://anonymous.4open.science/r/ah2ac2-2FDA}{anonymous repository}.
Paperid:2520
Authors:Linhao Luo · Zicheng Zhao · Reza Haffari · Yuan-Fang Li · Chen Gong · Shirui Pan
Abstract:
Large language models (LLMs) have demonstrated impressive reasoning abilities, but they still struggle with faithful reasoning due to knowledge gaps and hallucinations. To address these issues, knowledge graphs (KGs) have been utilized to enhance LLM reasoning through their structured knowledge. However, existing KGenhanced methods, either retrieval-based or agent-based, encounter difficulties in accurately retrieving knowledge and efficiently traversing KGs at scale. In this work, we introduce graph-constrained reasoning (GCR), a novel framework that bridges structured knowledge in KGs with unstructured reasoning in LLMs. To eliminate hallucinations, GCR ensures faithful KG-grounded reasoning by integrating KG structure into the LLM decoding process through KG-Trie, a trie-based index that encodes KG reasoning paths. KG-Trie constrains the decoding process, allowing LLMs to directly reason on graphs and generate faithful reasoning paths grounded in KGs. Additionally, GCR leverages a lightweight KG-specialized LLM for graph-constrained reasoning alongside a powerful general LLM for inductive reasoning over multiple reasoning paths, resulting in accurate reasoning with zero reasoning hallucination. Extensive experiments on several KGQA benchmarks demonstrate that GCR achieves state-of-the-art performance and exhibits strong zero-shot generalizability to unseen KGs without additional training.
Paperid:2521
Authors:Tal Lancewicki · Yishay Mansour
Abstract:
Abstract:We study online finitehorizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first *optimal* regret bound of $\tilde \Theta(H^2\sqrt{SAK})$, where $K$ is the number of episodes, $H$ is the episode horizon, $S$ is the number of states, and $A$ is the number of actions. In the unknown dynamics case we establish regret bound of $\tilde O(H^3 S \sqrt{AK})$, significantly improving the best known result by a factor of $H^2 S^5 A^2$.
Paperid:2522
Authors:Xiaohui Chen · Yinkai Wang · JIAXING HE · Yuanqi Du · Soha Hassoun · Xiaolin Xu · Liping Liu
Abstract:
Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this work revisits an alternative approach that represents graphs as sequences of node set and edge set. We advocate for this approach due to its efficient encoding of graphs and propose a novel representation. Based on this representation, we introduce the Graph Generative Pretrained Transformer (G2PT), an auto-regressive model that learns graph structures via next-token prediction. To further exploit G2PT's capabilities as a general-purpose foundation model, we explore fine-tuning strategies for two downstream applications: goal-oriented generation and graph property prediction. We conduct extensive experiments across multiple datasets. Results indicate that G2PT achieves superior generative performance on both generic graph and molecule datasets. Furthermore, G2PT exhibits strong adaptability and versatility in downstream tasks from molecular design to property prediction.
Paperid:2523
Authors:Tobias Pielok · Bernd Bischl · David Rügamer
Abstract:
Recent years have witnessed growing interest in semiimplicit variational inference (SIVI) methods due to their ability to rapidly generate samples from highly complicated distributions. However, since the likelihood of these samples is non-trivial to estimate in high dimensions, current research focuses on finding effective SIVI training routines. While unbiased implicit variational inference (UIVI) has largely been dismissed as imprecise and computationally prohibitive because of its inner MCMC loop, we revisit this method and identify key shortcomings. In particular, we show that UIVI's MCMC loop can be effectively replaced via importance sampling and the optimal proposal distribution can be learned stably by minimizing an expected forward Kullback–Leibler divergence without bias. Our refined approach demonstrates superior performance or parity with state-of-the-art methods on established SIVI benchmarks.
Paperid:2524
Authors:Zikai Zhou · Qizheng Zhang · Hermann Kumbong · Kunle Olukotun
Abstract:
Finetuning large language models (LLMs) is increasingly costly as models scale to hundreds of billions of parameters, and even parameter-efficient fine-tuning (PEFT) methods like LoRA remain resource-intensive.We introduce LowRA, the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss.LowRA optimizes fine-grained quantization—mapping, threshold selection, and precision assignment—while leveraging efficient CUDA kernels for scalable deployment.Extensive evaluations across 4 LLMs and 4 datasets show that LowRA achieves a superior performance–precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50\%. Our results highlight the potential of ultra-low-bit LoRA fine-tuning for resource-constrained environments.
Paperid:2525
Authors:Walid Bendada · Guillaume Salha-Galvan · Romain Hennequin · Théo Bontempelli · Thomas Bouabca · Tristan Cazenave
Abstract:
This paper introduces von MisesFisher exploration (vMF-exp), a scalable method for exploring large action sets in reinforcement learning problems where hyperspherical embedding vectors represent these actions. vMF-exp involves initially sampling a state embedding representation using a von Mises-Fisher distribution, then exploring this representation's nearest neighbors, which scales to virtually unlimited numbers of candidate actions.We show that, under theoretical assumptions, vMF-exp asymptotically maintains the same probability of exploring each action as Boltzmann Exploration (B-exp), a popular alternative that, nonetheless, suffers from scalability issues as it requires computing softmax values for each action.Consequently, vMF-exp serves as a scalable alternative to B-exp for exploring large action sets with hyperspherical embeddings. Experiments on simulated data, real-world public data, and the successful large-scale deployment of vMF-exp on the recommender system of a global music streaming service empirically validate the key properties of the proposed method.
Paperid:2526
Authors:Jiachen Yang · Yusong Wang · Yanmei Fang · Yunshu Dai · Fangjun Huang
Abstract:
Latent Diffusion Models (LDMs) enable finetuning with only a few images and have become widely used on the Internet. However, it can also be misused to generate fake images, leading to privacy violations and social risks. Existing adversarial attack methods primarily introduce noise distortions to generated images but fail to completely erase identity semantics. In this work, we identify the variance of VAE latent code as a key factor that influences image distortion. Specifically, larger variances result in stronger distortions and ultimately erase semantic information. Based on this finding, we propose a Laplace-based (LA) loss function that optimizes along the fastest variance growth direction, ensuring each optimization step is locally optimal. Additionally, we analyze the limitations of existing methods and reveal that their loss functions often fail to align gradient signs with the direction of variance growth. They also struggle to ensure efficient optimization under different variance distributions. To address these issues, we further propose a novel Lagrange Entropy-based (LE) loss function.Experimental results demonstrate that our methods achieve state-of-the-art performance on CelebA-HQ and VGGFace2. Both proposed loss functions effectively lead diffusion models to generate pure-noise images with identity semantics completely erased. Furthermore, our methods exhibit strong transferability across diverse models and efficiently complete attacks with minimal computational resources. Our work provides a practical and efficient solution for privacy protection.
Paperid:2527
Authors:Nora Schneider · Lars Lorch · Niki Kilbertus · Bernhard Schölkopf · Andreas Krause
Abstract:
We consider the problem of predicting perturbation effects via causal models. In many applications, it is a priori unknown which mechanisms of a system are modified by an external perturbation, even though the features of the perturbation are available. For example, in genomics, some properties of a drug may be known, but not their causal effects on the regulatory pathways of cells. We propose a generative intervention model (GIM) that learns to map these perturbation features to distributions over atomic interventions in a jointlyestimated causal model. Contrary to prior approaches, this enables us to predict the distribution shifts of unseen perturbation features while gaining insights about their mechanistic effects in the underlying data-generating process. On synthetic data and scRNA-seq drug perturbation data, GIMs achieve robust out-of-distribution predictions on par with unstructured approaches, while effectively inferring the underlying perturbation mechanisms, often better than other causal inference methods.
Paperid:2528
Authors:Xin Chen · Yarden As · Andreas Krause
Abstract:
Abstract Large language models (LLMs) have emerged as powerful tools but pose significant safety risks through harmful outputs and vulnerability to adversarial attacks. We propose SaP—short for Safety Polytope—a geometric approach to LLM safety, that learns and enforces multiple linear safety constraints directly in the model’s representation space. Our framework identifies safe and unsafe regions via the polytope’s facets, enabling both detection and correction of unsafe outputs through geometric steering. Unlike existing approaches that modify model weights, SaP operates posthoc in the representation space, preserving model capabilities while enforcing safety constraints. Experiments across multiple LLMs demonstrate that our method reduces adversarial attack success rates from 30-60% to below 3% while maintaining performance on standard tasks. Analysis of the learned polytope facets reveals natural specialization in detecting different safety concepts, providing interpretable insights into its safety mechanisms.
Paperid:2529
Authors:Benjamin Moseley · Helia Niaparast · Karan Singh
Abstract:
Global minimum cut is a fundamental combinatorial optimization problem with wideranging applications. Often in practice, these problems are solved repeatedly on families of similar or related instances. However, the de facto algorithmic approach is to solve each instance of the problem from scratch discarding information from prior instances. In this paper, we consider how predictions informed by prior instances can be used to warm-start practical minimum cut algorithms. The paper considers the widely used Karger's algorithm and its counterpart, the Karger-Stein algorithm. Given good predictions, we show these algorithms become near-linear time and have robust performance to erroneous predictions. Both of these algorithms are randomized edge-contraction algorithms. Our natural idea is to probabilistically prioritize the contraction of edges that are unlikely to be in the minimum cut.
Paperid:2530
Authors:Yuhao Huang · Taos Transue · Shih-Hsin Wang · William Feldman · Hong Zhang · Bao Wang
Abstract:
Conditional flow matching (CFM) stands out as an efficient, simulationfree approach for training flow-based generative models, achieving remarkable performance for data generation. However, CFM is insufficient to ensure accuracy in learning probability paths. In this paper, we introduce a new partial differential equation characterization for the error between the learned and exact probability paths, along with its solution. We show that the total variation between probability paths is bounded above by a combination of the CFM loss and an associated divergence loss. This theoretical insight leads to the design of a new objective function that simultaneously matches the flow and its divergence. Our new approach improves the performance of the flow-based generative model by a noticeable margin without significantly raising the computational cost. We showcase the advantages of this enhanced training approach over CFM on several important benchmark tasks, including generative modeling for dynamical systems, DNA sequences, and videos.
Paperid:2531
Authors:Ayush Jain · Alexander Swerdlow · Yuzhou Wang · Sergio Arnaud · Ada Martin · Alexander Sax · Franziska Meier · Katerina Fragkiadaki
Abstract:
Progress in 3D visionlanguage learning has been hindered by the scarcity of large-scale 3D datasets. We introduce UniVLG, a unified architecture for 2D and 3D vision-language understanding that bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems. Our approach initializes most model weights from pre-trained 2D models and trains on both 2D and 3D vision-language data. We propose a novel language-conditioned mask decoder shared across 2D and 3D modalities to ground objects effectively in both RGB and RGB-D images, outperforming box-based approaches. To further reduce the domain gap between 2D and 3D, we incorporate 2D-to-3D lifting strategies, enabling UniVLG to utilize 2D data to enhance 3D performance. With these innovations, our model achieves state-of-the-art performance across multiple 3D vision-language grounding tasks, demonstrating the potential of transferring advances from 2D vision-language learning to the data-constrained 3D domain. Furthermore, co-training on both 2D and 3D data enhances performance across modalities without sacrificing 2D capabilities. By removing the reliance on 3D mesh reconstruction and ground-truth object proposals, UniVLG sets a new standard for realistic, embodied-aligned evaluation.
Paperid:2532
Authors:Moritz Vandenhirtz · Julia Vogt
Abstract:
Understanding the decisionmaking process of machine learning models provides valuable insights into the task, the data, and the reasons behind a model's failures. In this work, we propose a method that performs inherently interpretable predictions through the instance-wise sparsification of input images. To align the sparsification with human perception, we learn the masking in the space of semantically meaningful pixel regions rather than on pixel-level. Additionally, we introduce an explicit way to dynamically determine the required level of sparsity for each instance. We show empirically on semi-synthetic and natural image datasets that our inherently interpretable classifier produces more meaningful, human-understandable predictions than state-of-the-art benchmarks.
Paperid:2533
Authors:The Viet Bui · Tien Mai · Thanh Nguyen
Abstract:
Inferring reward functions from demonstrations is a key challenge in reinforcement learning (RL), particularly in multiagent RL (MARL). The large joint state-action spaces and intricate inter-agent interactions in MARL make inferring the joint reward function especially challenging. While prior studies in single-agent settings have explored ways to recover reward functions and expert policies from human preference feedback, such studies in MARL remain limited. Existing methods typically combine two separate stages, supervised reward learning, and standard MARL algorithms, leading to unstable training processes. In this work, we exploit the inherent connection between reward functions and Q functions in cooperative MARL to introduce a novel end-to-end preference-based learning framework.Our framework is supported by a carefully designed multi-agent value decomposition strategy that enhances training efficiency. Extensive experiments on two state-of-the-art benchmarks, SMAC and MAMuJoCo, using preference data generated by both rule-based and large language model approaches demonstrate that our algorithm consistently outperforms existing methods across various tasks.
Paperid:2534
Authors:Joongkyu Lee · Min-hwan Oh
Abstract:
Abstract:In this paper, we propose an improved online confidence bound for multinomial logistic (MNL) models and apply this result to MNL bandits, achieving variancedependent optimal regret. Recently, Lee & Oh (2024) established an online confidence bound for MNL models and achieved nearly minimax-optimal regret in MNL bandits. However, their results still depend on the norm-boundedness of the unknown parameter $B$ and the maximum size of possible outcomes $K$. To address this, we first derive an online confidence bound of $\mathcal{O} (\sqrt{d \log t} + B )$, which is a significant improvement over the previous bound of $\mathcal{O} (B \sqrt{d} \log t \log K )$ (Lee & Oh, 2024). This is mainly achieved by establishing tighter self-concordant properties of the MNL loss and introducing a novel intermediary term to bound the estimation error. Using this new online confidence bound, we propose a constant-time algorithm, **OFU-MNL++**, which achieves a variance-dependent regret bound of $\mathcal{O} \Big( d \log T \sqrt{ \smash[b]{\sum_{t=1}^T} \sigma_t^2 } \Big) $ for sufficiently large $T$, where $\sigma_t^2$ denotes the variance of the rewards at round $t$, $d$ is the dimension of the contexts, and $T$ is the total number of rounds. Furthermore, we introduce an Maximum Likelihood Estimation (MLE)-based algorithm that achieves an anytime, **OFU-M$^2$NL**, $\operatorname{poly}(B)$-free regret of $\mathcal{O} \Big( d \log (BT) \sqrt{ \smash[b]{\sum_{t=1}^T} \sigma_t^2 } \Big) $.
Paperid:2535
Authors:Vladimir Braverman · Jon C. Ergun · Chen Wang · Samson Zhou
Abstract:
Hierarchical clustering (HC) is an important data analysis technique in which the goal is to recursively partition a dataset into a treelike structure while grouping together similar data points at each level of granularity. Unfortunately, for many of the proposed HC objectives, there exist strong barriers to approximation algorithms with the hardness of approximation. We consider the problem of hierarchical clustering given auxiliary information from natural oracles in the learning-augmented framework. Our main results are algorithms that given learning-augmented oracles, compute efficient approximate HC trees for the celebrated Dasgupta's and Moseley-Wang objectives that overcome known hardness barriers.
Paperid:2536
Authors:Ho Young Jhoo · Chung-Kil Hur · Nuno P. Lopes
Abstract:
The memory requirements of machine learning (ML) models has been growing quickly. However, the memory capacity of GPUs has not kept pace. Despite significant research on reducing the memory usage of ML models, the larger models do not fit in a single device.A popular solution to the memory capacity issue is to use multiple devices in parallel. In this paper, we focus on a particular form of parallelism called pipelining, as it offers a good balance between cost and performance for many ML models.We present Pfeife, the first tool that integrates with PyTorch to provide automatic pipelining of ML models. Pfeife intercepts the execution of models and parallelizes them transparently with no manual work required.We show that Pfeife is capable of running large models that would otherwise not run due to not fitting in a single device. Moreover, Pfeife can pipeline nonsequential models such as StableDiffusion that are not supported by previous pipelining parallelism tools. Pfeife outperforms state-of-the-art tools by up to 22%.
Paperid:2537
Authors:Thibaud Southiratn · Bonil Koo · Yijingxiu Lu · Sun Kim
Abstract:
Dualtarget molecule generation, which focuses on discovering compounds capable of interacting with two target proteins, has garnered significant attention due to its potential for improving therapeutic efficiency, safety and resistance mitigation.Existing approaches face two critical challenges:First, by simplifying the complex dual-target optimization problem to a linear combination of individual objectives, failing to capture important trade-offs between target engagement and molecular properties. Second, they typically do not integrate synthetic planning into the generative process.This highlights a need for more appropriate objective function design and synthesis-aware methodologies tailored to the dual-target molecule generation task.In this work, we propose CombiMOTS, a Pareto Monte Carlo Tree Search (PMCTS) framework that generates dual-target molecules.CombiMOTS is designed to explore a synthesizable fragment space while employing vectorized optimization constraints to encapsulate target affinity and physicochemical properties.Extensive experiments on real-world databases demonstrate that CombiMOTS produces novel dual-target molecules with high docking scores, enhanced diversity and balanced pharmacological characteristics, showcasing its potential as a powerful tool for dual-target drug discovery.
Paperid:2538
Authors:Jeonghye Kim · Yongjae Shin · Whiyoung Jung · Sunghoon Hong · Deunsol Yoon · Youngchul Sung · Kanghoon Lee · Woohyung Lim
Abstract:
Reinforcement learning with offline data suffers from Qvalue extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.
Paperid:2539
Authors:Pascal Junior Tikeng Notsawo · Guillaume Dumas · Guillaume Rabusseau
Abstract:
Abstract:Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradientbased methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a property $P$ (e.g., sparse or low-rank weights) that generalizes on the problem of interest, gradient descent with a small but non-zero regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) result in grokking. This extends previous work showing that small non-zero weight decay induces grokking. Moreover, our analysis shows that over-parameterization by adding depth makes it possible to grok or ungrok without explicitly using regularization, which is impossible in shallow cases. We further show that the $\ell_2$ norm of the model parameters cannot be used as an indicator of grokking in a general setting in place of the regularized property $P$: the $\ell_2$ norm grows in many cases where no weight decay is used, but the model generalizes anyway. We also show that grokking can be amplified through only data selection (with any other hyperparameter fixed).
Paperid:2540
Authors:Ruipeng Zhang · Ya-Chien Chang · Sicun Gao
Abstract:
The Maximum Entropy Reinforcement Learning (MaxEnt RL) framework is a leading approach for achieving efficient learning and robust performance across many RL tasks. However, MaxEnt methods have also been shown to struggle with performancecritical control problems in practice, where non-MaxEnt algorithms can successfully learn. In this work, we analyze how the trade-off between robustness and optimality affects the performance of MaxEnt algorithms in complex control tasks: while entropy maximization enhances exploration and robustness, it can also mislead policy optimization, leading to failure in tasks that require precise, low-entropy policies. Through experiments on a variety of control problems, we concretely demonstrate this misleading effect. Our analysis leads to better understanding of how to balance reward design and entropy maximization in challenging control problems.
Paperid:2541
Authors:Sophia Tang · Yinuo Zhang · Pranam Chatterjee
Abstract:
We present PepTune, a multiobjective discrete diffusion model for the simultaneous generation and optimization of therapeutic peptide SMILES. Built on the Masked Discrete Language Model (MDLM) framework, PepTune ensures valid peptide structures with bond-dependent masking schedules and penalty-based objectives. To guide the diffusion process, we propose a Monte Carlo Tree Search (MCTS)-based strategy that balances exploration and exploitation to iteratively refine Pareto-optimal sequences. MCTS integrates classifier-based rewards with search-tree expansion, overcoming gradient estimation challenges and data sparsity. Using PepTune, we generate diverse, chemically modified peptides simultaneously optimized for multiple therapeutic properties, including target binding affinity, membrane permeability, solubility, hemolysis, and non-fouling for various disease-relevant targets. In total, our results demonstrate that MCTS-guided masked discrete diffusion is a powerful and modular approach for multi-objective sequence design in discrete state spaces.
Paperid:2542
Authors:Yang Li · Jiale Ma · Yebin Yang · Qitian Wu · Hongyuan Zha · Junchi Yan
Abstract:
Directly predicting labels from data inputs has been a longstanding default mechanism for label learning tasks, e.g., supervised learning. Its trade-off between feature compression and label prediction is studied under the information theory framework, which typically assumes that the information content of labels is significantly less than that of data inputs, leading to model designs that prioritize compressing and extracting features from data inputs. However, recent prediction tasks often face predicting complex labels, exacerbating the challenge of learning mappings from learned features to high-fidelity label representations. To this end, we draw inspiration from the consistency training concept in generative consistency models and propose predictive consistency learning (PCL), a novel learning paradigm that decomposes the full label information into a progressive learning procedure, mitigating the label capture challenge. Unlike traditional approaches, PCL not only predicts labels directly from data inputs but also receives input from noise-perturbed labels as an additional reference, pursuing predictive consistency across different noise levels. It simultaneously learns the relationship between latent features and a spectrum of label information from zero to complete, which enables progressive learning for complex predictions and allows multi-step inference analogous to gradual denoising, thereby enhancing the prediction quality. Experiments on vision, text, and graph tasks show the superiority of PCL over conventional supervised training in complex label prediction tasks.
Paperid:2543
Authors:Junwei Su · shan Wu
Abstract:
Abstract:Graph diffusion generative models (GDGMs) have emerged as powerful tools for generating highquality graphs. However, their broader adoption faces challenges in \emph{scalability and size generalization}. GDGMs struggle to scale to large graphs due to their high memory requirements, as they typically operate in the full graph space, requiring the entire graph to be stored in memory during training and inference. This constraint limits their feasibility for large-scale real-world graphs. GDGMs also exhibit poor size generalization, with limited ability to generate graphs of sizes different from those in the training data, restricting their adaptability across diverse applications. To address these challenges, we propose the stochastic block graph diffusion (SBGD) model, which refines graph representations into a block graph space. This space incorporates structural priors based on real-world graph patterns, significantly reducing memory complexity and enabling scalability to large graphs. The block representation also improves size generalization by capturing fundamental graph structures. Empirical results show that SBGD achieves significant memory improvements (up to 6$\times$) while maintaining comparable or even superior graph generation performance relative to state-of-the-art methods. Furthermore, experiments demonstrate that SBGD better generalizes to unseen graph sizes. The significance of SBGD extends beyond being a scalable and effective GDGM; \emph{it also exemplifies the principle of modularization in generative modelling, offering a new avenue for exploring generative models by decomposing complex tasks into more manageable components.}
Paperid:2544
Authors:Zijun Liao · Jinbiao Chen · Debing Wang · Zizhen Zhang · Jiahai Wang
Abstract:
Neural Combinatorial Optimization (NCO) has emerged as a promising approach for NPhard problems. However, prevailing RL-based methods suffer from low sample efficiency due to sparse rewards and underused solutions. We proposePreference Optimization for Combinatorial Optimization (POCO), a training paradigm that leverages solution preferences via objective values. It introduces: (1) an efficient preference pair construction for better explore and exploit solutions, and (2) a novel loss function that adaptively scales gradients via objective differences, removing reliance on reward models or reference policies. Experiments on Job-Shop Scheduling (JSP), Traveling Salesman (TSP), and Flexible Job-Shop Scheduling (FJSP) show POCO outperforms state-of-the-art neural methods, reducing optimality gaps impressively with efficient inference. POCO is architecture-agnostic, enabling seamless integration with existing NCO models, and establishes preference optimization as a principled framework for combinatorial optimization.
Paperid:2545
Authors:Oleh Rybkin · Michal Nauman · Preston Fu · Charlie Snell · Pieter Abbeel · Sergey Levine · Aviral Kumar
Abstract:
Scaling data and compute is critical in modern machine learning. However, scaling also demandspredictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from low compute or low data runs, without ever running the largescale experiment. In this paper, we show predictability of value-based off-policy deep RL. First, we show that data and compute requirements to reach a given performance level lie on aPareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can extrapolate data requirements into a higher compute regime, and compute requirements into a higher data regime. Second, we determine the optimal allocation of totalbudgetacross data and compute to obtain given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between differenthyperparameters, which is used to counteract effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.
Paperid:2546
Authors:Wenke Huang · Jian Liang · Zekun Shi · Didi Zhu · Guancheng Wan · He Li · Bo Du · Dacheng Tao · Mang Ye
Abstract:
Multimodal Large Language Model (MLLM) have demonstrated strong generalization capabilities across diverse distributions and tasks, largely due to extensive pretraining datasets. Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks. However, during fine-tuning, MLLM often faces the risk of forgetting knowledge acquired during pre-training, which can result in a decline in generalization abilities. To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions, based on frozen pre-trained weight magnitude and accumulated fine-tuning gradient values. We further apply an importance-aware weight allocation strategy, selectively updating relatively important parameters for downstream tasks. We conduct empirical evaluations on both image captioning and visual question-answering tasks using various MLLM architectures. The comprehensive experimental analysis demonstrates the effectiveness of the proposed solution, highlighting the efficiency of the crucial modules in enhancing downstream specialization performance while mitigating generalization degradation in MLLM Fine-Tuning.
Paperid:2547
Authors:Aravind Rajeswaran · Arjun Majumdar · Mikael Henaff · Oleksandr Maksymets · Krishna Murthy Jatavallabhula · Mrinal Kalakrishnan · Sergio Arnaud · Paul McVay · Alexander Sax · Ada Martin · Ruslan Partsey · Franziska Meier · Nicolas Ballas · Michael Rabbat · Mahmoud Assran · Ayush Jain · Vincent-Pierre Berges · Ang Cao · Abha Gejji · Daniel Dugas · Phillip Thomas · Ishita Prasad
Abstract:
We present Locate 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp". Locate 3D sets a new stateof-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, Locate 3D operates directly on sensor observation streams (posed RGB-D images), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. The input to 3D-JEPA is a 3D pointcloud, featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LX3D, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.
Paperid:2548
Authors:Nicholas Carlini · Edoardo Debenedetti · Javier Rando · Milad Nasr · Florian Tramer
Abstract:
We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing (security) benchmarks that often serve as proxies for realworld tasks, AutoAdvExBench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in AutoAdvExBench, it would immediately present practical utility for adversarial machine learning researchers. We then design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses. However, we show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code. This benchmark is publicly accessible at [redacted for review].
Paperid:2549
Authors:Roy Betser · Meir Yossef Levi · Guy Gilboa
Abstract:
Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive LanguageImage Pre-training (CLIP) to assess the likelihood of images and captions. We introduce Whitened CLIP, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each fea-ture in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embeddings statistics can be well approximated as a standard normal distribution, thus, the log-likelihood is estimated simply by the square Euclidean norm in the whitened embedding space. The whitening procedure is completely training-free and performed using a pre-computed whitening matrix, hence, is very fast. We present several preliminary experiments demonstrating the properities and applicability of these likelihood scores to images and captions.
Paperid:2550
Authors:Stas Syrota · Yevgen Zainchkovskyy · Johnny Xi · Benjamin Bloem-Reddy · Søren Hauberg
Abstract:
Deep latent variable models learn condensed representations of data that, hopefully, reflect the inner workings of the studied phenomena. Unfortunately, these latent representations are not statistically identifiable, meaning they cannot be uniquely determined. Domain experts, therefore, need to tread carefully when interpreting these. Current solutions limit the lack of identifiability through additional constraints on the latent variable model, e.g. by requiring labeled training data, or by restricting the expressivity of the model. We change the goal: instead of identifying the latent variables, we identify relationships between them such as meaningful distances, angles, and volumes. We prove this is feasible under very mild model conditions and without additional labeled data. We empirically demonstrate that our theory results in more reliable latent distances, offering a principled path forward in extracting trustworthy conclusions from deep latent variable models.
Paperid:2551
Authors:Benedikt Böck · Andreas Oeldemann · Timo Mayer · Francesco Rossetto · Wolfgang Utschick
Abstract:
Learning the distribution of the wireless channel within a specific environment of interest is essential to exploit the full potential of machine learning (ML) for wireless communication and radar applications. Generative modeling offers a promising framework to address this problem. However, existing approaches pose unresolved challenges, including the need for highquality training data, limited generalizability, and a lack of physical interpretability. To address these issues, we propose a model that combines the physics-related compressibility of wireless channels with sparse Bayesian generative modeling (SBGM) to learn the distribution of the underlying physical channel parameters. By leveraging the sparsity-inducing characteristics of SBGM, our method can learn from compressed observations received by an access point (AP) during default online operation. Moreover, it is physically interpretable and generalizes to arbitrary system configurations without requiring retraining.
Paperid:2552
Authors:Daniel Gomm · Zhan Qu · Michael Färber
Abstract:
Abstract:Temporal Graph Neural Networks (TGNNs) are widely used to model dynamic systems where relationships and features evolve over time. Although TGNNs demonstrate strong predictive capabilities in these domains, their complex architectures pose significant challenges for explainability. Counterfactual explanation methods provide a promising solution by illustrating how modifications to input graphs can influence model predictions. To address this challenge, we present **CoDy**—**Co**unterfactual Explainer for **Dy**namic Graphs—a modelagnostic, instance-level explanation approach that identifies counterfactual subgraphs to interpret TGNN predictions. CoDy employs a search algorithm that combines Monte Carlo Tree Search with heuristic selection policies, efficiently exploring a vast search space of potential explanatory subgraphs by leveraging spatial, temporal, and local gradient information. Extensive experiments against state-of-the-art factual and counterfactual baselines demonstrate CoDy's effectiveness, achieving average $fidelity_+$ scores up to 6.11 times higher than the baselines. Our code is available at: https://anonymous.4open.science/r/CoDy-F2CF
Paperid:2553
Authors:Bowen Tan · Zheng Xu · Eric Xing · Zhiting Hu · Shanshan Wu
Abstract:
Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, promptbased methods like private evolution, depend heavily on manual prompts and ineffectively use private information in their filtering-based process. To overcome these limitations, we propose CTCL (Data Synthesis with Controllability and Clustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To adapt to the private domain, the generator is DP-finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Further analysis validates the design of each framework component and highlights the scalability of our approach.
Paperid:2554
Authors:Mingjie Sun · Yida Yin · Zhiqiu (Oscar) Xu · Zico Kolter · Zhuang Liu
Abstract:
In this work, we unveil and study the idiosyncrasies in Large Language Models (LLMs) unique patterns in the outputs of particular LLMs that can be used to distinguish between them. To do so, we consider a simple classification task -- given a particular output sequence, the goal is to predict the source LLM. We consider this synthetic task on several groups of LLMs, and find that simply fine-tuning existing text embedding models on the LLM outputs results in excellent classification accuracy. Notably, we achieve 97.8% accuracy on held-out validation data for a four-way classification task involving GPT-4o, Claude-3.5-Sonnet, Grok-2, and Gemini-1.5. Our further investigation reveals that these idiosyncrasies are heavily rooted in their word-level distributions. Intriguingly, these patterns persist even when the responses are rewritten, translated, or summarized via an external LLM, suggesting they are also encoded in their semantic content. We further leverage LLMs to generate detailed, open-ended descriptions of each model's idiosyncrasy. Finally, we explore the broader implications of our findings, particularly for synthetic data generation and inferring model similarity.
Paperid:2555
Authors:Guangzhi Sun · Xiao Zhan · Shutong Feng · Phil Woodland · Jose Such
Abstract:
Abstract:Aligning large language models (LLMs) with human values is essential for their safe deployment and widespread adoption. Current LLM safety benchmarks often focus solely on the refusal of individual problematic queries, which overlooks the importance of the context where the query occurs and may cause undesired refusal of queries under safe contexts that diminish user experience. Addressing this gap, we introduce CASEBench, a Context-Aware SafEty Benchmark that integrates context into safety assessments of LLMs. CASE-Bench assigns distinct, formally described contexts to categorized queries based on Contextual Integrity theory. Additionally, in contrast to previous studies which mainly rely on majority voting from just a few annotators, we recruited a sufficient number of annotators necessary to ensure the detection of statistically significant differences among the experimental conditions based on power analysis. Our extensive analysis using CASE-Bench on various open-source and commercial LLMs reveals a substantial and significant influence of context on human judgments ($p<$0.0001 from a z-test), underscoring the necessity of context in safety evaluations. We also identify notable mismatches between human judgments and LLM responses, particularly in commercial models within safe contexts. Code and data used in the paper are available at https://anonymous.4open.science/r/CASEBench-D5DB.
Paperid:2556
Authors:Xuying Ning · Dongqi Fu · Tianxin Wei · Wujiang Xu · Jingrui He
Abstract:
Realworld multimodal data, such as e-commerce pages and scientific documents, usually exhibit complex structural relationships beyond traditional one-to-one mappings like image-caption pairs. Entities across modalities interact in intricate ways, with images and text forming diverse interconnections through contextual dependencies and co-references. Graphs provide a powerful structural information for modeling intra-modal and inter-modal relationships. However, previous works fail to distinguish multi-hop neighbors and treat the graph as a standalone modality, which fragments the overall understanding. This limitation presents two key challenges in multimodal learning: (1) integrating structural information from multi-hop neighbors into foundational models, and (2) fusing modality-specific information in a principled manner. To address these challenges, we revisit the role of graphs in multimodal learning within the era of foundation models and propose Graph4MM, a graph-based multimodal learning framework. To be specific, we introduce Hop-Diffused Attention, which integrates multi-hop structural information into self-attention through causal masking and hop diffusion, enabling fine-grained relational modeling. Furthermore, we design MM-QFormer, a multi-mapping querying transformer for cross-modal fusion. Through theoretical and empirical analysis, we show that leveraging structures to integrate both intra- and inter-modal interactions improves multimodal understanding beyond treating them as a standalone modality. Experiments on both generative and discriminative tasks show that Graph4MM outperforms larger VLMs, LLMs, and multimodal graph baselines, achieving a 6.93% average improvement, demonstrating the effectiveness of graph integration.
Paperid:2557
Authors:Cosimo Gregucci · Bo Xiong · Daniel Hernández · Lorenzo Loconte · Pasquale Minervini · Steffen Staab · Antonio Vergari
Abstract:
Complex query answering (CQA) on knowledge graphs (KGs) is gaining momentum as a challenging reasoning task.In this paper, we show that the current benchmarks for CQA might not be ascomplexas we think, as the way they are built distorts our perception of progress in this field.For example, we find that in these benchmarks most queries (up to 98% for some query types) can be reduced to simpler problems, e.g., link prediction, where only one link needs to be predicted.The performance of stateof-the-art CQA models drops significantly when such models are evaluated on queries that cannot be reduced to easier types.Thus, we propose a set of more challenging benchmarks, composed of queries thatrequiremodels to reason over multiple hops and better reflect the construction of real-world KGs.In a systematic empirical investigation, the new benchmarks show that current methods leave much to be desired from current CQA methods.
Paperid:2558
Authors:Yining Pan · Qiongjie Cui · Xulei Yang · Na Zhao
Abstract:
LiDARbased 3D panoptic segmentation often struggles with the inherent sparsity of data from LiDAR sensors, which makes it challenging to accurately recognize distant or small objects. Recently, a few studies have sought to overcome this challenge by integrating LiDAR inputs with camera images, leveraging the rich and dense texture information provided by the latter. While these approaches have shown promising results, they still face challenges, such as misalignment during data augmentation and the reliance on post-processing steps. To address these issues, we propose Image-Assists-LiDAR (IAL), a novel multi-modal 3D panoptic segmentation framework. In IAL, we first introduce a modality-synchronized data augmentation strategy, PieAug, to ensure alignment between LiDAR and image inputs from the start. Next, we adopt a transformer decoder to directly predict panoptic segmentation results. To effectively fuse LiDAR and image features into tokens for the decoder, we design a Geometric-guided Token Fusion (GTF) module. Additionally, we leverage the complementary strengths of each modality as priors for query initialization through a Prior-based Query Generation (PQG) module, enhancing the decoder’s ability to generate accurate instance masks. Our IAL framework achieves state-of-the-art performance compared to previous multi-modal 3D panoptic segmentation methods on two widely used benchmarks. We will release our code upon paper acceptance.
Paperid:2559
Authors:Xiao Han · Xiangchao Li · Qing Yang
Abstract:
Abstract:In this paper, we investigate deep feedforward neural networks with random weights. The input data matrix $\boldsymbol{X}$ is drawn from a Gaussian mixture model. We demonstrate that certain eigenvalues of the conjugate kernel and neural tangent kernel may lie outside the support of their limiting spectral measures in the highdimensional regime. The existence and asymptotic positions of such isolated eigenvalues are rigorously analyzed. Furthermore, we provide a precise characterization of the entrywise limit of the projection matrix onto the eigenspace associated with these isolated eigenvalues. Our findings reveal that the eigenspace captures inherent group features present in $\boldsymbol{X}$. This study offers a quantitative analysis of how group features from the input data evolve through hidden layers in randomly weighted neural networks.
Paperid:2560
Authors:Minqi Yu · Jinduo Liu · Junzhong Ji
Abstract:
Deep models are increasingly used to analyze brain graphs for the diagnosis and understanding of brain diseases. However, due to the multisite data aggregation and individual differences, brain graph datasets exhibit widespread distribution shifts, which impair the model’s generalization ability to the test set, thereby limiting the performance of existing methods. To address these issues, we propose a Causally Invariance-aware Augmentation for brain Graph Contrastive Learning, called CIA-GCL. This method first generates a brain graph by extracting node features based on the topological structure. Then, a learnable brain invariant subgraph is identified based on a causal decoupling approach to capture the maximum label-related invariant information with invariant learning. Around this invariant subgraph, we design a novel invariance-aware augmentation strategy to generate meaningful augmented samples for graph contrast learning. Finally, the extracted invariant subgraph is utilized for brain disease classification, effectively mitigating distribution shifts while also identifying critical local graph structures, enhancing the model’s interpretability. Experiments on three real-world brain disease datasets demonstrate that our method achieves state-of-the-art performance, effectively generalizes to multi-site brain datasets, and provides certain interpretability.
Paperid:2561
Authors:Gaspard Lambrechts · Damien Ernst · Aditya Mahajan
Abstract:
In reinforcement learning for partially observable environments, many successful algorithms were developed within the asymmetric learning paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a theoretical justification for their potential benefits. We propose such a justification for asymmetric actorcritic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates an error term arising from aliasing in the agent state.
Paperid:2562
Authors:Saketh Bachu · Erfan Shayegani · Rohit Lal · Trishna Chakraborty · Arindam Dutta · Chengyu Song · Yue Dong · Nael Abu-Ghazaleh · Amit Roy-Chowdhury
Abstract:
Visionlanguage models (VLMs) have improved significantly in their capabilities, but their complex architecture makes their safety alignment challenging. In this paper, we reveal an uneven distribution of harmful information across the intermediate layers of the image encoder and show that skipping a certain set of layers and exiting early can increase the chance of the VLM generating harmful responses. We call it as “Image enCoder Early-exiT” based vulnerability (ICET). Our experiments across three VLMs: LLaVA-1.5, LLaVA-NeXT, and Llama 3.2 show that performing early exits from the image encoder significantly increases the likelihood of generating harmful outputs. To tackle this, we propose a simple yet effective modification of the Clipped-Proximal Policy Optimization (Clip-PPO) algorithm for performing layer-wise multi-modal RLHF for VLMs. We term this as Layer-Wise PPO (L-PPO). We evaluate our L-PPO algorithm across three multi-modal datasets and show that it consistently reduces the harmfulness caused by early exits.
Paperid:2563
Authors:Weijie Tu · Weijian Deng · Dylan Campbell · Yu Yao · Jiyang Zheng · Tom Gedeon · Tongliang Liu
Abstract:
Abstract:Can the relative performance of a pretrained large multimodal model (LMM) be predicted without access to labels?As LMMs proliferate, it becomes increasingly important to develop efficient ways to choose between them when faced with new data or tasks.The usual approach does the equivalent of giving the models an exam and marking them.We opt to avoid marking and the associated labor of determining the ground-truth answers.Instead, we explore other signals elicited and ascertain how well the models know their own limits, evaluating the effectiveness of these signals at unsupervised model ranking.We evaluate $45$ state-of-the-art LMMs (e.g., LLaVA) across $8$ visual question answering benchmarks, analyzing how well uncertainty-based metrics can predict relative model performance.Our findings show that uncertainty scores derived from softmax distributions provide a robust and consistent basis for ranking models across various tasks.This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
Paperid:2564
Authors:Quang-Duy Tran · Bao Duong · Phuoc Nguyen · Thin Nguyen
Abstract:
Telling apart the cause and effect between two random variables with purely observational data is a challenging problem that finds applications in various scientific disciplines. A key principle utilized in this task is the algorithmic Markov condition, which postulates that the joint distribution, when factorized according to the causal direction, yields a more succinct codelength compared to the anticausal direction. Previous approaches approximate these codelengths by relying on simple functions or Gaussian processes (GPs) with easily evaluable complexity, compromising between model fitness and computational complexity. To overcome these limitations, we propose leveraging the variational Bayesian learning of neural networks as an interpretation of the codelengths. Consequently, we can enhance the model fitness while promoting the succinctness of the codelengths, while avoiding the significant computational complexity of the GP-based approaches. Extensive experiments on both synthetic and real-world benchmarks in cause-effect identification demonstrate the effectiveness of our proposed method, surpassing the overall performance of related complexity-based and structural causal model regression-based approaches.
Paperid:2565
Authors:Matthew Smart · Alberto Bietti · Anirvan Sengupta
Abstract:
We introduce incontext denoising, a task that refines the connection between attention-based architectures and dense associative memory (DAM) networks, also known as modern Hopfield networks. Using a Bayesian framework, we show theoretically and empirically that certain restricted denoising problems can be solved optimally even by a single-layer transformer. We demonstrate that a trained attention layer processes each denoising prompt by performing a single gradient descent update on a context-aware DAM energy landscape, where context tokens serve as associative memories and the query token acts as an initial state. This one-step update yields better solutions than exact retrieval of either a context token or a spurious local minimum, providing a concrete example of DAM networks extending beyond the standard retrieval paradigm. Overall, this work solidifies the link between associative memory and attention mechanisms first identified by Ramsauer et al., and demonstrates the relevance of associative memory models in the study of in-context learning.
Paperid:2566
Authors:Yifan Fang · Yifei Fang · Ruizhe Chen · Haote Xu · Xinghao Ding · Yue Huang
Abstract:
Abstract:Reconstructionbased anomaly detection task requires an interpretable theoretical framework to ensure the validity of the detection process. We propose a novel test to detect anomalies in structural image via a discrete Fourier transform (DFT) under factor model framework. Our test is constructed by comparing the DFT of weighted residuals to the zero spectrum from null hypothesis of no anomaly. We also briefly give the asymptotic theories of our test. Furthermore, our test can detect a class of local anomalies at a rate of $H^{-1/2}W^{-1/2}$, where $H,W$ is the size of the image, respectively. Based on our test, we derive a framework for effective reconstruction in unsupervised-anomaly detection tasks. Specifically, by applying the proposed factor model-based framework for anomaly detection, we proposed a straightforward approach called the Demeaned Fourier Sparse (DFS) module to construct masks in the Fourier domain and utilize a distribution-free sampling method similar to the bootstrap method. This enables both interpretability and effectiveness in our detection approach. The evaluation results from applying our method demonstrate that the factor model-based anomaly detection approach can successfully generate valid masks for reconstruction-based anomaly detection tasks, using the unspecified sampling technique, and performing anomaly detection efficiently.
Paperid:2567
Authors:Nico Meyer · Christian Ufrecht · George Yammine · Georgios Kontes · Christopher Mutschler · Daniel Scherer
Abstract:
Benchmarking and establishing proper statistical validation metrics for reinforcement learning (RL) remain ongoing challenges, where no consensus has been established yet. The emergence of quantum computing and its potential applications in quantum reinforcement learning (QRL) further complicate benchmarking efforts. To enable valid performance comparisons and to streamline current research in this area, we propose a novel benchmarking methodology, which is based on a statistical estimator for sample complexity and a definition of statistical outperformance. Furthermore, considering QRL, our methodology casts doubt on some previous claims regarding its superiority. We conducted experiments on a novel benchmarking environment with flexible levels of complexity. While we still identify possible advantages, our findings are more nuanced overall. We discuss the potential limitations of these results and explore their implications for empirical research on quantum advantage in QRL.
Paperid:2568
Authors:Lan Feng · Fan Nie · Yuejiang Liu · Alexandre Alahi
Abstract:
We propose TAROT, a targeted data selection framework grounded in Optimal Transport theory. Previous targeted data selection methods primarily rely on influencebased greedy heuristics to enhance domain-specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, such heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary limitations: (i) the disproportionate impact of dominant feature components in high-dimensional influence estimation, and (ii) the restrictive linear additive assumptions in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, offering a more reliable measure of data influence. Building on this, TAROT leverages whitened feature distance to quantify and minimize the optimal transport distance between selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, demonstrating its versatility across various deep learning tasks. Code is available in the supplementary.
Paperid:2569
Authors:YuQing Xie · Ameya Daigavane · Mit Kotak · Tess Smidt
Abstract:
Abstract:$E(3)$equivariant neural networks have demonstrated success across a wide range of 3D modelling tasks. A fundamental operation in these networks is the tensor product, which interacts two geometric features in an equivariant manner to create new features. Due to the high computational complexity of the tensor product, significant effort has been invested to optimize the runtime of this operation. For example, Luo, et al. recently proposed the Gaunt tensor product (GTP) which promises a significant speedup. In this work, we provide a careful, systematic analysis of a number of tensor product operations. In particular, we emphasize that different tensor products are not performing the same operation. The reported speedups typically come at the cost of expressivity. We introduce measures of expressivity and interactability to characterize these differences. In addition, we realized the original implementation of GTP can be greatly simplified by directly using a spherical grid at no cost in asymptotic runtime. This spherical grid approach is faster on our benchmarks and in actual training of the MACE interatomic potential by 30\%. Finally, we provide the first systematic microbenchmarks of the various tensor product operations. We find that the theoretical runtime guarantees can differ wildly from empirical performance, demonstrating the need for careful application-specific benchmarking.
Paperid:2570
Authors:Montgomery Bohde · Mrunali Manjrekar · Runzhong Wang · Shuiwang Ji · Connor Coley
Abstract:
Mass spectrometry plays a fundamental role in elucidating the structures of unknown molecules and subsequent scientific discoveries. One formulation of the structure elucidation task is the conditionalde novogeneration of molecular structure given a mass spectrum. Toward a more accurate and efficient scientific discovery pipeline for small molecules, we present DiffMS, a formularestricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula. To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs, which are available in virtually infinite quantities, compared to structure-spectrum pairs that number in the tens of thousands. Extensive experiments on established benchmarks show that DiffMS outperforms existing models onde novomolecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size.
Paperid:2571
Authors:Jiayi Xin · Sukwon Yun · Jie Peng · Inyoung Choi · Jenna Ballard · Tianlong Chen · Qi Long
Abstract:
Abstract:Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, existing approaches are limited by $\textbf{(a)}$ their focus on modality correspondences, which neglects heterogeneous interactions between modalities, and $\textbf{(b)}$ the fact that they output a single multimodal prediction without offering interpretable insights into the multimodal interactions present in the data. In this work, we propose $\texttt{I$^2$MoE}$ ($\underline{I}$nterpretable Multimodal $\underline{I}$nteractionaware $\underline{M}$ixture-$\underline{o}$f-$\underline{E}$xperts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, $\texttt{I$^2$MoE}$ utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, $\texttt{I$^2$MoE}$ deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that $\texttt{I$^2$MoE}$ is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios.
Paperid:2572
Authors:Jiaxin Ye · Boyuan Cao · Hongming Shan
Abstract:
How much can we infer about an emotional voice solely from an expressive face? This intriguing question holds great potential for applications such as virtual character dubbing and aiding individuals with expressive language disorders. Existing faceto-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression. In this paper, we explore a new task, termedemotional face-to-speech, aiming to synthesize emotional speech directly from expressive facial cues. To that end, we introduceDEmoFace, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning, built upon a multi-level neural audio codec. Specifically, we propose multimodal DiT blocks to dynamically align text and speech while tailoring vocal styles based on facial emotion and identity. To enhance training efficiency and generation quality, we further introduce a coarse-to-fine curriculum learning algorithm for multi-level token processing. In addition, we develop an enhanced predictor-free guidance to handle diverse conditioning scenarios, enabling multi-conditional generation and disentangling complex attributes effectively. Extensive experimental results demonstrate that DEmoFace generates more natural and consistent speech compared to baselines, even surpassing speech-driven methods.
Paperid:2573
Authors:Jan Ole Ernst · Aniket Chatterjee · Tim Franzmeyer · Axel Kuhn
Abstract:
Quantum optimal control is concerned with the realisation of desired dynamics in quantum systems, serving as a linchpin for advancing quantum technologies and fundamental research. Analytic approaches and standard optimisation algorithms do not yield satisfactory solutions for large quantum systems, and especially not for real world quantum systems which are open and noisy. We devise a physicsinformed Reinforcement Learning (RL) algorithm that restricts the space of possible solutions. We incorporate priors about the desired time scales of the quantum state dynamics -- as well as realistic control signal limitations -- as constraints to the RL algorithm. These physics-informed constraints additionally improve computational scalability by facilitating parallel optimisation. We evaluate our method on three broadly relevant quantum systems and incorporate real-world complications, arising from dissipation and control signal perturbations. We achieve both higher fidelities -- which exceed 0.999 across all systems -- and better robustness to time-dependent perturbations and experimental imperfections than previous methods. Lastly, we demonstrate that incorporating multi-step feedback can yield solutions robust even to strong perturbations.
Paperid:2574
Authors:Zhiyong Wang · Jiahang Sun · Mingze Kong · Jize Xie · Qinghua Hu · John C. S. Lui · Zhongxiang Dai
Abstract:
The contextual multiarmed bandit (MAB) is a widely used framework for problems requiring sequential decision-making under uncertainty, such as recommendation systems. In applications involving a large number of users, the performance of contextual MAB can be significantly improved by facilitating collaboration among multiple users. This has been achieved by the clustering of bandits (CB) methods, which adaptively group the users into different clusters and achieve collaboration by allowing the users in the same cluster to share data. However, classical CB algorithms typically rely on numerical reward feedback, which may not be practical in certain real-world applications. For instance, in recommendation systems, it is more realistic and reliable to solicit preference feedback between pairs of recommended items rather than absolute rewards. To address this limitation, we introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback. We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions. Both algorithms are supported by rigorous theoretical analyses, demonstrating that user collaboration leads to improved regret bounds. Extensive empirical evaluations on synthetic and real-world datasets further validate the effectiveness of our methods, establishing their potential in real-world applications involving multiple users with preference-based feedback.
Paperid:2575
Authors:Yongxian Wei · Anke Tang · Li Shen · Zixuan Hu · Chun Yuan · Xiaochun Cao
Abstract:
Abstract:Merging multiple expert models offers a promising approach for performing multitask learning without accessing their original data. Existing methods attempt to alleviate task conflicts by sparsifying task vectors or promoting orthogonality among them. However, they overlook the fundamental requirement of model merging: ensuring the merged model performs comparably to task-specific models on respective tasks. We find these methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance. Based on our findings, we frame model merging as a constrained optimization problem ($\textit{i.e.}$, minimizing the gap between the merged model and individual models, subject to the constraint of retaining shared knowledge) and solve it via adaptive projective gradient descent. Specifically, we align the merged model with individual models by decomposing and reconstituting the loss function, alleviating conflicts through $\textit{data-free}$ optimization of task vectors. To retain shared knowledge, we optimize this objective by projecting gradients within a $\textit{shared subspace}$ spanning all tasks. Moreover, we view merging coefficients as adaptive learning rates and propose a task-aware, training-free strategy. Experiments show that our plug-and-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.
Paperid:2576
Authors:Katherine Tieu · Dongqi Fu · Zihao Li · Ross Maciejewski · Jingrui He
Abstract:
Accurate predictions rely on the expressiveness power of graph deep learning frameworks like graph neural networks and graph transformers, where a positional encoding mechanism has become much more indispensable in recent stateof-the-art (SOTA) works to record the canonical position information. However, the current positional encoding limits in three aspects, at least: (1) most positional encodings are pre-defined, and fixed functions, which are inadequate to adapt to the complex attributed graphs; (2) a few pioneering works propose the learnable positional encoding but still limited to the structural information, leaving the real-world time-evolving topological and feature information untouched; (3) most positional encodings should be equipped with transformer's attention mechanism to fully release the power, where the dense or relational attention is often unaffordable on large-scale structured data.Hence, we study the possibility of Learnable Spatial-Temporal Positional Encoding in an effective and efficient manner and then propose a simple temporal link prediction model named L-STEP. Briefly, for L-STEP, we (1) prove the proposed positional learning scheme can preserve the graph property from the spatial-temporal spectral viewpoint, (2) verify that MLPs can fully exploit the expressiveness and reach Transformers' performance on that encoding, (3) change different initial positional encoding inputs to show robustness, (4) analyze the theoretical complexity and obtain less empirical running time than SOTA, and (5) demonstrate its temporal link prediction out-performance on 13 classic datasets and with 10 algorithms in both transductive and inductive settings using 3 different sampling strategies. Also, L-STEP obtains the leading performance in the newest large-scale TGB benchmark.
Paperid:2577
Authors:Dongyang Fan · Bettina Messmer · Nikita Doikov · Martin Jaggi
Abstract:
Abstract:Ondevice LLMs have gained increasing attention for their ability to enhance privacy and provide a personalized user experience. To facilitate private learning with scarce data, Federated Learning has become a standard approach. However, it faces challenges such as computational resource heterogeneity and data heterogeneity among end users. We propose CoMiGS ($\textbf{Co}$llaborative learning with a $\textbf{Mi}$xture of $\textbf{G}$eneralists and $\textbf{S}$pecialists), the first approach to address both challenges. A key innovation of our method is the bi-level optimization formulation of the Mixture-of-Experts learning objective, where the router is optimized using a separate validation set to ensure alignment with the target distribution. We solve our objective with alternating minimization, for which we provide a theoretical analysis. Our method shares generalist experts across users while localizing a varying number of specialist experts, thereby adapting to users’ computational resources and preserving privacy. Through extensive experiments, we show CoMiGS effectively balances general and personalized knowledge for each token generation. We demonstrate that CoMiGS remains robust against overfitting—due to the generalists' regularizing effect—while adapting to local data through specialist expertise. We open source our codebase for collaborative LLMs.
Paperid:2578
Authors:Arthur Deng · Fang Wu · Karsten Householder · K. Garcia · Brian Trippe
Abstract:
Accurate estimation of mutational effects on proteinprotein binding energies is an open problem with applications in structural biology and therapeutic design. Several deep learning methods for this task have been proposed, but they underperform more classical strategies based on empirical force-fields. In response, we introduce new transfer-learning techniques that allow advances in protein sequence modeling and folding stability prediction to enhance the accuracy of binding energy estimation and refinement. In particular, we propose to parameterize binding energy prediction as the difference between predictions of the folding energy of the protein complex from the sum of the folding energies of the binding partners. We show that using a pre-trained inverse-folding model as a proxy for folding energy provides strong zero-shot performance, and can be fine-tuned with (1) copious folding energy measurements and (2) more-limited binding energy measurements. The resulting predictor, StaB-ddG, has desirable thermodynamic properties and delivers more accurate predictions than previous methods for protein-protein interaction energetics.
Paperid:2579
Authors:Xin Xu · Qiyun Xu · Tong Xiao · Tianhao Chen · Yuchen Yan · Jiaxin ZHANG · shizhe diao · Can Yang · Yang Wang
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks, particularly in mathematics. However, the domain of physics reasoning presents unique challenges that have received significantly less attention. Existing benchmarks often fall short in evaluating LLMs’ abilities on the breadth and depth of undergraduatelevel physics, underscoring the need for a comprehensive evaluation. To fill this gap, we introduce UGPhysics, a large-scale and diverse benchmark specifically designed to evaluateUnderGraduate-levelPhysics(UGPhysics) reasoning with LLMs. UGPhysics includes 5,520 undergraduate-level physics problems in both English and Chinese across 13 subjects with seven different answer types and four distinct physics reasoning skills, all rigorously screened for data leakage. Additionally, we develop a Model-Assistant Rule-based Judgment (MARJ) pipeline specifically tailored for assessing physics problems, ensuring accurate evaluation. Our evaluation of 31 leading LLMs shows that the highest overall accuracy, 49.8% (achieved by OpenAI-o1-mini), emphasizes the need for models with stronger physics reasoning skills, beyond math abilities. We hope UGPhysics, along with MARJ, will drive future advancements in AI for physics reasoning.
Paperid:2580
Authors:Aditya Desai · Shuo Yang · Alejandro Cuadron · Matei Zaharia · Joseph E Gonzalez · Ion Stoica
Abstract:
Abstract:Leveraging long contexts is crucial for advanced AI systems, but attention computation poses a scalability challenge. While scaled dotproduct attention (SDPA) exhibits token sparsity, i.e. only a few pivotal tokens significantly contribute to output, exploiting this sparsity remains challenging. Existing methods either suffer from quality degradation or require substantial additional resources. We show that identifying pivotal tokens is a Maximum Inner Product Search (MIPS) problem. However, existing MIPS solutions are not well-suited for SDPA, as they are not GPU-friendly and often underperform due to the separated query and key distributions. This paper introduces HashAttention, framing pivotal token identification as a recommendation problem. Given a query, HashAttention encodes keys and queries in Hamming space, capturing the required semantic similarity, using learned mapping functions. HashAttention efficiently identifies pivotal tokens for a given query using bitwise operations and computes attention using only these tokens, improving the overall attention efficiency. Trained on generic data, HashAttention reduces tokens used by up to $16\times$ with minimal quality loss, requiring only 32 bits of auxiliary memory per token. Sparsity can be further improved to $32\times$ through task-specific fine-tuning. On A100 GPU, at $32\times$ sparsity, incorporating HashAttention reduces attention latency by up to $4.3\times$ in GPT-FAST and $2.54\times$ in FlashDecode, and achieves up to $3.12\times$ higher throughput for GPT-FAST.
Paperid:2581
Authors:Hongyao Tang · Johan Obando Ceron · Pablo Samuel Castro · Aaron Courville · Glen Berseth
Abstract:
Plasticity, or the ability of an agent to adapt to new tasks, environments, or distributions, is crucial for continual learning. In this paper, we study the loss of plasticity in deep continual RL from the lens of churn: network output variability induced by the data in each training batch. We demonstrate that (1) the loss of plasticity is accompanied by the exacerbation of churn due to the gradual rank decrease of the Neural Tangent Kernel (NTK) matrix; (2) reducing churn helps prevent rank collapse and adjusts the step size of regular RL gradients adaptively. Moreover, we introduce Continual Churn Approximated Reduction (CCHAIN) and demonstrate it improves learning performance and outperforms baselines in a range of continual learning environments on OpenAI Gym Control and ProcGen benchmarks.
Paperid:2582
Authors:Yang Zheng · Wen Li · Zhaoqiang Liu
Abstract:
Inverse problems (IPs) involve reconstructing signals from noisy observations. Traditional approaches often rely on handcrafted priors, which can fail to capture the complexity of realworld data. The advent of pre-trained generative models has introduced new paradigms, offering improved reconstructions by learning rich priors from data. Among these, diffusion models (DMs) have emerged as a powerful framework, achieving remarkable reconstruction performance across numerous IPs. However, existing DM-based methods frequently encounter issues such as heavy computational demands and suboptimal convergence. In this work, building upon the idea of the recent work DMPlug~\cite{wang2024dmplug}, we propose two novel methods, DMILO and DMILO-PGD, to address these challenges. Our first method, DMILO, employs intermediate layer optimization (ILO) to alleviate the memory burden inherent in DMPlug. Additionally, by introducing sparse deviations, we expand the range of DMs, enabling the exploration of underlying signals that may lie outside the range of the diffusion model. We further propose DMILO-PGD, which integrates ILO with projected gradient descent (PGD), thereby reducing the risk of suboptimal convergence. We provide an intuitive theoretical analysis of our approach under appropriate conditions and validate its superiority through extensive experiments on diverse image datasets, encompassing both linear and nonlinear IPs. Our results demonstrate significant performance gains over state-of-the-art methods, highlighting the effectiveness of DMILO and DMILO-PGD in addressing common challenges in DM-based IP solvers.
Paperid:2583
Authors:Lakshmi Nair · Ian Trase · J. Kim
Abstract:
We present a novel reasoning approach called Flowof-Options (FoO), designed to address intrinsic biases in Large Language Models (LLMs). FoO enables LLMs to systematically explore a diverse range of possibilities in their reasoning, as demonstrated by an FoO-based agentic system for autonomously solving Machine Learning tasks. Our framework outperforms state-of-the-art baselines, achieving improvements of 38.2% -- 69.2% on standard data science tasks, and 37.4% -- 47.9% on therapeutic chemistry tasks. With an overall operation cost under $1 per task, our framework is well-suited for cost-sensitive applications. Beyond classification and regression tasks, we illustrate the broader applicability of our FoO-based agentic system to additional tasks such as reinforcement learning and image generation.
Paperid:2584
Authors:Junbo Yin · Chao Zha · Wenjia He · Chencheng Xu · Xin Gao
Abstract:
Existing PLMs generate protein sequences based on a singlecondition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-GEN, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-GEN facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-GEN enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins.
Paperid:2585
Authors:Weikang Qiu · Zheng Huang · Haoyu Hu · Aosong Feng · Yujun Yan · ZHITAO YING
Abstract:
Decoding functional magnetic resonance imaging (fMRI) signals into text has been a key challenge in the neuroscience community, with the potential to advance braincomputer interfaces and uncover deeper insights into brain mechanisms. However, existing approaches often struggle with suboptimal predictive performance, limited task variety, and poor generalization across subjects. In response to this, we propose MindLLM, a model designed for subject-agnostic and versatile fMRI-to-text decoding. MindLLM consists of an fMRI encoder and an off-the-shelf LLM. The fMRI encoder employs a neuroscience-informed attention mechanism, which is capable of accommodating subjects with varying input shapes and thus achieves high-performance subject-agnostic decoding. Moreover, we introduce Brain Instruction Tuning (BIT), a novel approach that enhances the model’s ability to capture diverse semantic representations from fMRI signals, facilitating more versatile decoding. We evaluate MindLLM on comprehensive fMRI-to-text benchmarks. results demonstrate that our model outperforms the baselines in downstream tasks, unseen subject generalization, and novel task adaptation. Furthermore, the attention patterns in MindLLM provide interpretable insights into its decision-making process.
Paperid:2586
Authors:Thibault de Surrel · Fabien Lotte · Sylvain Chevallier · Florian Yger
Abstract:
Circular and nonflat data distribution are prevalent across diverse domains of data science, yet their specific geometric structures often remain underutilized in machine learning frameworks.A principled approach to accounting for the underlying geometry of such data is pivotal, particularly when extending statistical models, like the pervasive Gaussian distribution.In this work, we tackle those issue by focusing on the manifold of symmetric positive definite matrices, a key focus in information geometry.We introduced a non-isotropic wrapped Gaussian by leveraging the exponential map, we derive theoretical properties of this distribution and propose a maximum likelihood framework for parameter estimation. Furthermore, we reinterpret established classifiers on SPD through a probabilistic lens and introduce new classifiers based on the wrapped Gaussian model.Experiments on synthetic and real-world datasets demonstrate the robustness and flexibility of this geometry-aware distribution, underscoring its potential to advance manifold-based data analysis.This work lays the groundwork for extending classical machine learning and statistical methods to more complex and structured data.
Paperid:2587
Authors:Qian Qi
Abstract:
We establish a continuoustime framework for analyzing Deep Q-Networks (DQNs) via stochastic control and Forward-Backward Stochastic Differential Equations (FBSDEs). Considering a continuous-time Markov Decision Process (MDP) driven by a square-integrable martingale, we analyze DQN approximation properties. We show that DQNs can approximate the optimal Q-function on compact sets with arbitrary accuracy and high probability, leveraging residual network approximation theorems and large deviation bounds for the state-action process. We then analyze the convergence of a general Q-learning algorithm for training DQNs in this setting, adapting stochastic approximation theorems. Our analysis emphasizes the interplay between DQN layer count, time discretization, and the role of viscosity solutions in addressing potential non-smoothness of the optimal Q-function. This work bridges deep reinforcement learning and stochastic control, offering insights into DQNs in continuous-time settings, relevant for applications with physical systems or high-frequency data.
Paperid:2588
Authors:Gonçalo Paulo · Alex Mallen · Caden Juang · Nora Belrose
Abstract:
While the activations of neurons in deep neural networks usually do not have a simple humanunderstandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which can be more easily interpretable. However, SAEs can have millions of distinct latents, making it infeasible for humans to manually interpret each one. In this work, we build an open-source automated pipeline to generate and evaluate natural language interpretations for SAE latents using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of interpretations that are cheaper to run than the previous state of the art. One of these techniques, intervention scoring, evaluates the interpretability of the effects of intervening on a latent, which we find explains latents that are not recalled by existing methods. We propose guidelines for generating better interpretations that remain valid for a broader set of activating contexts, and discuss pitfalls with existing scoring techniques. Our code is available at https://anonymous.4open.science/r/interpreting_latents/.
Paperid:2589
Authors:Ilias Diakonikolas · Mingchen Ma · Lisheng Ren · Christos Tzamos
Abstract:
Abstract:We study the task of Multiclass Linear Classification (MLC) in the distributionfree PAC model with Random Classification Noise (RCN). Specifically, the learner is given a set of labeled examples $(x, y)$, where $x$ is drawn from an unknown distribution on $R^d$ and the labels are generated by a multiclass linear classifier corrupted with RCN. That is, the label $y$ is flipped from $i$ to $j$ with probability $H_{ij}$ according to a known noise matrix $H$ with non-negative separation $\sigma: = \min_{i \neq j} H_{ii}-H_{ij}$. The goal is to compute a hypothesis with small 0-1 error. For the special case of two labels, prior work has given polynomial-time algorithms achieving the optimal error. Surprisingly, little is known about the complexity of this task even for three labels.As our main contribution, we show that the complexity of MLC with RCN becomes drastically different in the presence of three or more labels. Specifically, we prove super-polynomialStatistical Query (SQ) lower bounds for this problem. In more detail, even for three labels and constant separation, we give a super-polynomial lower bound on the complexity of any SQ algorithm achieving optimal error. For a larger number of labels and smaller separation, we show a super-polynomial SQ lower bound even for the weaker goal of achieving any constant factor approximation to the optimal loss or even beating the trivial hypothesis.
Paperid:2590
Authors:George Orfanides · Tim Hoheisel · Marwa El Halabi
Abstract:
Submodular functions, defined on continuous or discrete domains, arise in numerous applications. We study the minimization of the difference of submodular (DS) functions, over both domains, extending prior work restricted to set functions.We show that all functions on discrete domains and all smooth functions on continuous domains are DS. For discrete domains, we observe that DS minimization is equivalent to minimizing the difference of two convex (DC) functions, as in the set function case. We propose a novel variant of the DC Algorithm (DCA) and apply it to the resulting DC Program, obtaining comparable theoretical guarantees as in the set function case. The algorithm can be applied to continuous domains via discretization. Experiments demonstrate that our method outperforms baselines in integer compressive sensing and integer least squares.
Paperid:2591
Authors:Zhe Wang · Jiaxin Shi · Nicolas Heess · Arthur Gretton · Michalis Titsias
Abstract:
Autoregressive models (ARMs) have become the workhorse for sequence generation tasks, since many problems can be modeled as nexttoken prediction. While there appears to be a natural order-ing for text (i.e., left-to-right), for many data types,such as graphs, the canonical ordering is less ob-vious. To address this problem, we introduce a variant of ARM that generates high-dimensional data using a probabilistic ordering that is sequen-tially inferred from data. This model incorporates a trainable probability distribution, referred to as an order-policy, that dynamically decides the au-toregressive order in a state-dependent manner. Totrain the model, we introduce a variational lower bound on the exact log-likelihood, which we op-timize with stochastic gradient estimation. Wedemonstrate experimentally that our method canlearn meaningful autoregressive orderings in im-age and graph generation. On the challenging do-main of molecular graph generation, we achieve state-of-the-art results on the QM9 and ZINC250k benchmarks, evaluated using the Fr ́echet Chem-Net Distance (FCD).
Paperid:2592
Authors:Taeyoung Yun · Kiyoung Om · Jaewoo Lee · Sujin Yun · Jinkyoo Park
Abstract:
Optimizing highdimensional and complex black-box functions is crucial in numerous scientific applications.While Bayesian optimization (BO) is a powerful method for sample-efficient optimization, it struggles with the curse of dimensionality and scaling to thousands of evaluations. Recently, leveraging generative models to solve black-box optimization problems has emerged as a promising framework.However, those methods often underperform compared to BO methods due to limited expressivity and difficulty of uncertainty estimation in high-dimensional spaces.To overcome these issues, we introduce \textbf{DiBO}, a novel framework for solving high-dimensional black-box optimization problems.Our method iterates two stages. First, we train a diffusion model to capture the data distribution and an ensemble of proxies to predict function values with uncertainty quantification.Second, we cast the candidate selection as a posterior inference problem to balance exploration and exploitation in high-dimensional spaces. Concretely, we fine-tune diffusion models to amortize posterior inference.Extensive experiments demonstrate that our method outperforms state-of-the-art baselines across various synthetic and real-world black-box optimization tasks. Our code is publicly available \href{https://anonymous.4open.science/r/DiBO-E486}{here}.
Paperid:2593
Authors:Dongjae Jeon · Dueun Kim · Albert No
Abstract:
In this paper, we introduce a geometric framework to analyze memorization in diffusion models through the sharpness of the log probability density. We mathematically justify a previously proposed scoredifference-based memorization metric by demonstrating its effectiveness in quantifying sharpness. Additionally, we propose a novel memorization metric that captures sharpness at the initial stage of image generation in latent diffusion models, offering early insights into potential memorization. Leveraging this metric, we develop a mitigation strategy that optimizes the initial noise of the generation process using a sharpness-aware regularization term.
Paperid:2594
Authors:Xin Su · Man Luo · Kris Pan · Tien Pei Chou · Vasudev Lal · Phillip Howard
Abstract:
Multimodal retrievalaugmented generation (RAG) plays a crucial role in domains such as knowledge-based visual question answering (KB-VQA), where models should effectively integrate additional knowledge to generate a response. However, existing vision and language models (VLMs) are not inherently designed for context-augmented generation, limiting their effectiveness in such tasks. While synthetic data generation has recently gained attention for training large VLMs, its application for context-augmented generation remains underexplored. To address this gap, we introduce SKVQA, a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs, each associated with external knowledge sources to determine the final answer. Compared to previous datasets, SKVQA exhibits 11× more unique questions, greater domain diversity, and a broader spectrum of image sources. Through human evaluations, we confirm the high quality of the generated question-answer pairs and their contextual relevance. Extensive experiments show that SKVQA serves both as a challenging benchmark for knowledge-based VQA and as an effective training resource for adapting generative multimodal models to context-augmented generation. Our results further indicate that models trained on SKVQA demonstrate enhanced generalization in both context-aware VQA and multimodal RAG settings.
Paperid:2595
Authors:Chamika Sudusinghe · Gerasimos Gerogiannis · Damitha Lenadora · Charles Block · Josep Torrellas · Charith Mendis
Abstract:
Sparse tensor programs are essential in deep learning and graph analytics, driving the need for optimized processing. To meet this demand, specialized hardware accelerators are being developed. Optimizing these programs for accelerators is challenging for two reasons: program performance is highly sensitive to variations in sparse inputs, and earlystage accelerators rely on expensive simulators. Therefore, ML-based cost models used for optimizing such programs on general-purpose hardware are often ineffective for early-stage accelerators, as they require large datasets for proper training. To this end, we introduce COGNATE, a novel framework that leverages inexpensive data samples from general-purpose hardware (e.g., CPUs) to train cost models, followed by few-shot fine-tuning on emerging hardware. COGNATE exploits the homogeneity of input features across hardware platforms while effectively mitigating heterogeneity, enabling cost model training with just 5% of the data samples needed by accelerator-specific models to achieve comparable performance. We conduct extensive experiments to demonstrate that COGNATE outperforms existing techniques, achieving average speedups of 1.47× (up to 5.46×) for SpMM and 1.39× (up to 4.22×) for SDDMM.
Paperid:2596
Authors:Zirui Liu · Jiatong Li · Yan Zhuang · Qi Liu · Shuanghong Shen · Jie Ouyang · Mingyue Cheng · Shijin Wang
Abstract:
Arenabased evaluation is a fundamental yet significant evaluation paradigm for modern AI models, especially large language models (LLMs). Existing framework based on ELO rating system suffers from the inevitable instability problem due to ranking inconsistency and the lack of attention to the varying abilities of annotators. In this paper, we introduce a novel stable arena framework to address these issues by enhancing the ELO Rating System. Specifically, we replace the iterative update method with a Maximum Likelihood Estimation (MLE) approach, m-ELO, and provide theoretical proof of the consistency and stability of the MLE approach for model ranking. Additionally, we proposed the am-ELO, which modify the Elo Rating’s probability function to incorporate annotator abilities, enabling the simultaneous estimation of model scores and annotator reliability. Experiments demonstrate that this method ensures stability, proving that this framework offers a more robust, accurate, and stable evaluation method for LLMs.
Paperid:2597
Authors:Tianyi Zhang · Yu Cao · Dianbo Liu
Abstract:
Federated learning (FL), aimed at leveraging vast distributed datasets, confronts a crucial challenge: the heterogeneity of data across different silos. While previous studies have explored discrete representations to enhance model generalization across minor distributional shifts, these approaches often struggle to adapt to new data silos with significantly divergent distributions. In response, we have identified that models derived from FL exhibit markedly increased uncertainty when applied to data silos with unfamiliar distributions. Consequently, we propose an innovative yet straightforward iterative framework, termed \emph{UncertaintyBased Extensible-Codebook Federated Learning (UEFL)}. This framework dynamically maps latent features to trainable discrete vectors, assesses the uncertainty, and specifically extends the discretization dictionary or codebook for silos exhibiting high uncertainty. Our approach aims to simultaneously enhance accuracy and reduce uncertainty by explicitly addressing the diversity of data distributions, all while maintaining minimal computational overhead in environments characterized by heterogeneous data silos. Through experiments conducted on various datasets, our method has demonstrated its superiority, achieving significant improvements in accuracy (by 3\%--22.1\%) and uncertainty reduction (by 38.83\%--96.24\%), thereby outperforming contemporary state-of-the-art methods.
Paperid:2598
Authors:Junkang Wu · xue wang · Zhengyi Yang · Jiancan Wu · Jinyang Gao · Bolin Ding · Xiang Wang · Xiangnan He
Abstract:
Aligning large language models (LLMs) with human preferences requires balancing policy optimization with computational stability. While recent offline methods like DPO and SimPO bypass reinforcement learning’s complexity, they face critical limitations: DPO relies on static reference models that degrade with policy updates, and SimPO assumes a uniform target reward margin that ignores instancewise preference strength. We propose AlphaDPO, an adaptive preference optimization framework that dynamically reparameterizes the reference distribution to address these issues. Our key innovation lies in an implicit reference model (\hat{\pi}{\text{ref}} \propto U(y|x)(\pi\theta/\pi_{\text{ref}})^\alpha), which interpolates between policy-driven specialization and uniform exploration while enabling instance-adaptive reward margins. Theoretically, we prove \method{} implicitly controls sequential KL divergence between iterative policy updates, ensuring stability even with poorly calibrated reference models. Empirically, \method{} achieves state-of-the-art performance on AlpacaEval 2 (58.7\% LC win rate) and Arena-Hard (35.7\% win rate) across Mistral2-7B, Llama3-8B, and Gemma2-9B, demonstrating robust alignment without multi-stage training. Our work establishes adaptive reference reparameterization as a principled mechanism for preference optimization.
Paperid:2599
Authors:Meilin Chen · Jian Tian · Liang Ma · Di Xie · Weijie Chen · Jiang Zhu
Abstract:
Benchmark contamination has become a significant concern in the LLM evaluation community. Previous Agentsas-an-Evaluator address this issue by involving agents in the generation of questions. Despite their success, the biases in Agents-as-an-Evaluator methods remain largely unexplored. In this paper, we present a theoretical formulation of evaluation bias, providing valuable insights into designing unbiased evaluation protocols. Furthermore, we identify two type of bias in Agents-as-an-Evaluator through carefully designed probing tasks on a minimal Agents-as-an-Evaluator setup. To address these issues, we propose the Unbiased Evaluator, an evaluation protocol that delivers a more comprehensive, unbiased, and interpretable assessment of LLMs. Extensive experiments reveal significant room for improvement in current LLMs. Additionally, we demonstrate that the Unbiased Evaluator not only offers strong evidence of benchmark contamination but also provides interpretable evaluation results. Code will be released soon.
Paperid:2600
Authors:Xiangxin Zhou · xiao yi · Mingyu Li · Jiahan Li · Dongyu Xue · Zaixiang Zheng · Jianzhu Ma · Quanquan Gu
Abstract:
Cyclic peptides offer inherent advantages in pharmaceuticals. For example, cyclic peptides are more resistant to enzymatic hydrolysis compared to linear peptides and usually exhibit excellent stability and affinity. Although deep generative models have achieved great success in linear peptide design, several challenges prevent the development of computational methods for designing diverse types of cyclic peptides. These challenges include the scarcity of 3D structural data on target proteins and associated cyclic peptide ligands, the geometric constraints that cyclization imposes, and the involvement of nonstandard amino acids in cyclization. To address the above challenges, we introduce CpSDE, which consists of two key components: AtomSDE, a generative structure prediction model based on harmonic SDE, and ResRouter, a residue type predictor. Utilizing a routed sampling algorithm that alternates between these two models to iteratively update sequences and structures, CpSDE facilitates the generation of cyclic peptides. By employing explicit all-atom and bond modeling, CpSDE overcomes existing data limitations and is proficient in designing a wide variety of cyclic peptides.Our experimental results demonstrate that the cyclic peptides designed by our method exhibit reliable stability and affinity.
Paperid:2601
Authors:Yujin Han · Andi Han · Wei Huang · Chaochao Lu · Difan Zou
Abstract:
Abstract:Despite the remarkable success of diffusion models (DMs) in data generation, they exhibit specific failure cases with unsatisfactory outputs. We focus on one such limitation: the ability of DMs to learn hidden rules between image features. Specifically, for image data with dependent features ($\mathbf{x}$) and ($\mathbf{y}$) (e.g., the height of the sun ($\mathbf{x}$) and the length of the shadow ($\mathbf{y}$)), we investigate whether DMs can accurately capture the interfeature rule ($p(\mathbf{y}|\mathbf{x})$). Empirical evaluations on mainstream DMs (e.g., Stable Diffusion 3.5) reveal consistent failures, such as inconsistent lighting-shadow relationships and mismatched object-mirror reflections. Inspired by these findings, we design four synthetic tasks with strongly correlated features to assess DMs' rule-learning abilities. Extensive experiments show that while DMs can identify coarse-grained rules, they struggle with fine-grained ones. Our theoretical analysis demonstrates that DMs trained via denoising score matching (DSM) exhibit constant errors in learning hidden rules, as the DSM objective is not compatible with rule conformity. To mitigate this, we introduce a common technique - incorporating additional classifier guidance during sampling, which achieves (limited) improvements. Our analysis reveals that the subtle signals of fine-grained rules are challenging for the classifier to capture, providing insights for future exploration.
Paperid:2602
Authors:Sen Xing · Muyan Zhong · Zeqiang Lai · Liangchen Li · Jiawen Liu · Yaohui Wang · Jifeng Dai · Wenhai Wang
Abstract:
In this work, we explore a costeffective framework for multilingual image generation. We find that, unlike models tuned on high-quality images with multilingual annotations, leveraging text encoders pre-trained on widely available, noisy Internet image-text pairs significantly enhances data efficiency in text-to-image (T2I) generation across multiple languages. Based on this insight, we introduce MuLan,Multi-Language adapter, a lightweight language adapter with fewer than 20M parameters, trained alongside a frozen text encoder and image diffusion model. Compared to previous multilingual T2I models, this framework offers: (1) Cost efficiency. Using readily accessible English data and off-the-shelf multilingual text encoders minimizes the training cost; (2) High performance. Achieving comparablegeneration capabilities in over 110 languages with CLIP similarity scores nearly matching those in English (39.57 for English vs. 39.61 for other languages); and (3) Broad applicability. Seamlessly integrating with compatible community tools like LoRA, LCM, ControlNet, and IP-Adapter, expanding its potential use cases.
Paperid:2603
Authors:Liang CHEN · Xueting Han · Li Shen · Jing Bai · Kam-Fai Wong
Abstract:
Harmful finetuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representations on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks and exhibit lower robustness compared to other subsets. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which calculates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the hard group and then applies group-dependent adversarial perturbations to the data during training, ultimately achieving convergence to a balanced learning equilibrium. Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.
Paperid:2604
Authors:Enes Altinisik · Safa Messaoud · Husrev Taha Sencar · Hassan Sajjad · Sanjay Chawla
Abstract:
Adversarial Training (AT) impacts different architectures in distinct ways: vision models gain robustness but face reduced generalization, encoderbased models exhibit limited robustness improvements with minimal generalization loss, and recent work in latent-space adversarial training demonstrates that decoder-based models achieve improved robustness by applying AT across multiple layers.We provide the first explanation for these trends by leveraging the manifold conjecture: off-manifold adversarial examples (AEs) enhance robustness, while on-manifold AEs improve generalization.We show that vision and decoder-based models exhibit low intrinsic dimensionality in earlier layers (favoring off-manifold AEs), whereas encoder-based models do so in later layers (favoring on-manifold AEs). Exploiting this property, we introduce SMAAT, which improves the scalability of AT for encoder-based models by perturbing the layer with the lowest intrinsic dimensionality. This reduces the projected gradient descent (PGD) chain length required for AE generation, cutting GPU time by 25–33% while significantly boosting robustness. We validate SMAAT across multiple tasks, including text generation, sentiment classification, safety filtering, and retrieval in RAG setups, demonstrating superior robustness with comparable generalization to standard training.
Paperid:2605
Authors:Bingshan Hu · Zhiming Huang · Tianyue Zhang · Mathias Lécuyer · Nidhi Hegde
Abstract:
Abstract:We address differentially private stochastic bandit problems from the angles of exploring the deep connections among Thompson Sampling with Gaussian priors, Gaussian mechanisms, and Gaussian differential privacy (GDP). We propose DPTS-UCB, a novel parametrized private bandit algorithm that enables to trade off privacy and regret. DP-TS-UCB satisfies $ \tilde{O} \left(T^{0.25(1-\alpha)}\right)$-GDP and enjoys an $O \left(K\ln^{\alpha+1}(T)/\Delta \right)$ regret bound, where $\alpha \in [0,1]$ controls the trade-off between privacy and regret.Theoretically, our DP-TS-UCB relies on anti-concentration bounds of Gaussian distributions and links exploration mechanisms in Thompson Sampling-based algorithms and Upper Confidence Bound-based algorithms, which may be of independent interest.
Paperid:2606
Authors:Xuantong Liu · Shaozhe Hao · Xianbiao Qi · Tianyang Hu · JUN WANG · Rong Xiao · Yuan Yao
Abstract:
Abstract:The success of large language models (LLMs) in text generation has inspired their application to image generation. However, existing methods either rely on specialized designs with inductive biases or adopt LLMs without fully exploring their potential in vision tasks. In this work, we systematically investigate the design space of LLMs for image generation and demonstrate that LLMs can achieve near stateof-the-art performance without domain-specific designs, simply by making proper choices in tokenization methods, modeling approaches, scan patterns, vocabulary design, and sampling strategies. We further analyze autoregressive models' learning and scaling behavior, revealing how larger models effectively capture more useful information than the smaller ones. Additionally, we explore the inherent differences between text and image modalities, highlighting the potential of LLMs across domains. The exploration provides valuable insights to inspire more effective designs when applying LLMs to other domains. With extensive experiments, our proposed model, **ELM** achieves an FID of 1.54 on 256$\times$256 ImageNet and an FID of 3.29 on 512$\times$512 ImageNet, demonstrating the powerful generative potential of LLMs in vision tasks.
Paperid:2607
Authors:Yijiang Li · Qingying Gao · Tianwei Zhao · Haiyun Lyu · Bingyang Wang · Haoran Sun · Robert Hawkins · Nuno Vasconcelos · Tal Golan · Dezhi Luo · Hokin Deng
Abstract:
While Multimodal Large Language Models (MLLMs) demonstrate impressive abilities over high-level perception and reasoning, their robustness in the wild still lags behind humans and exhibits diminished efficacy on simple tasks that are intuitive for humans. We examine the hypothesis that these deficiencies stem from the absence of core knowledge—rudimentary cognitive abilities innate to humans from early childhood. To probe core knowledge representation in MLLMs, we draw from developmental cognitive sciences and develop a large-scale benchmark, \textbf{CoreCognition dataset}, encompassing 12 core cognitive concepts. We evaluate 219 models with 10 different prompts, leading to a total of 2409 data points for analysis. Our findings reveal core knowledge deficits in early-developed core abilities while models demonstrate human-comparable performance in high-level cognition.. Moreover, we find that low-level abilities show little to no scaling , in stack contrast to high-level abilities. Finally, we introduce an evaluation technique ``Concept Hacking", through which we demonstrate that MLLMs do not genuinely advance toward core knowledge but instead rely on superficial patterns and shortcut learning as they scale.
Paperid:2608
Authors:Nikhil Pratap Ghanathe · Steve Wilton
Abstract:
Uncertainty quantification (UQ) provides a resourceefficient solution for on-device monitoring of tinyML models deployed remotely without access to true labels. However, existing UQ methods impose significant memory and compute demands, making them impractical for ultra-low-power, KB-sized tinyML devices. Prior work has attempted to reduce overhead by using early-exit ensembles to quantify uncertainty in a single forward pass, but these approaches still carry prohibitive costs. To address this, we propose QUTE, a novel resource-efficient early-exit-assisted ensemble architecture optimized for tinyML models. QUTE introduces additional output blocks at the final exit of the base network, distilling early-exit knowledge into these blocks to form a diverse yet lightweight ensemble. We show that QUTE delivers superior uncertainty quality on tiny models, achieving comparable performance on larger models with 59% smaller model sizes than the closest prior work. When deployed on a microcontroller, QUTE demonstrates a 31% reduction in latency on average. In addition, we show that QUTE excels at detecting accuracy-drop events, outperforming all prior works.
Paperid:2609
Authors:Yang Li · Jie Ma · Miguel Ballesteros · Yassine Benajiba · Graham Horwood
Abstract:
As large language models (LLMs) become increasingly versatile, numerous large scale benchmarks have been developed to thoroughly assess their capabilities. These benchmarks typically consist of diverse datasets and prompts to evaluate different aspects of LLM performance. However, comprehensive evaluations on hundreds or thousands of prompts incur tremendous costs in terms of computation, money, and time. In this work, we investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy. Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples based on the outcomes of the selected ones. Consequently, we only need to acquire the actual evaluation outcomes for the selected subset. We rigorously explore various subset selection policies and introduce a novel RLbased policy that leverages the captured dependencies. Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required while maintaining accurate performance estimates compared to previous methods.
Paperid:2610
Authors:Kyowoon Lee · Jaesik Choi
Abstract:
Recent advances in diffusionbased generative modeling have demonstrated significant promise in tackling long-horizon, sparse-reward tasks by leveraging offline datasets. While these approaches have achieved promising results, their reliability remains inconsistent due to the inherent stochastic risk of producing infeasible trajectories, limiting their applicability in safety-critical applications. We identify that the primary cause of these failures is inaccurate guidance during the sampling procedure, and demonstrate the existence of manifold deviation by deriving a lower bound on the guidance gap. To address this challenge, we proposeLocal Manifold Approximation and Projection(LoMAP), atraining-freemethod that projects the guided sample onto a low-rank subspace approximated from offline datasets, preventing infeasible trajectory generation. We validate our approach on standard offline reinforcement learning benchmarks that involve challenging long-horizon planning. Furthermore, we show that, as a standalone module, LoMAP can be incorporated into the hierarchical diffusion planner, providing further performance enhancements.
Paperid:2611
Authors:Jon Saad-Falcon · Adrian Lafuente · Shlok Natarajan · Nahum Maru · Hristo Todorov · Etash Guha · Estefany Kelly Buchanan · Mayee Chen · Neel Guha · Christopher Re · Azalia Mirhoseini
Abstract:
Inferencetime techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of each technique across models and tasks, the interactions between them, and the massive search space for combining them. To address these challenges, we introduce Archon, a modular and automated framework for optimizing the process of selecting and combining inference-time techniques and LLMs. Given a compute budget and a set of available LLMs,Archon explores a large design space to discover optimized configurations tailored to target benchmarks. It can design custom or general-purpose architectures that advance the Pareto frontier of accuracy vs. maximum token budget compared to top-performing baselines. Across instruction-following, reasoning, and coding tasks, we show that Archon can leverage additional inference compute budget to design systems that outperform frontier models such as OpenAI’s o1, GPT-4o, and Claude 3.5 Sonnet by an average of 15.1%.
Paperid:2612
Authors:HyunGi Kim · Jisoo Mok · Dong Jun Lee · Jaihyun Lew · Sungjae Sungjae · Sungroh Yoon
Abstract:
Utilizing the complex intervariable causal relationships within multivariate time-series provides a promising avenue toward more robust and reliable multivariate time-series anomaly detection (MTSAD) but remains an underexplored area of research. This paper proposes Causality-Aware contrastive learning for RObust multivariate Time-Series (CAROTS), a novel MTSAD pipeline that incorporates the notion of causality into contrastive learning. CAROTS employs two data augmentors to obtain causality-preserving and -disturbing samples that serve as a wide range of normal variations and synthetic anomalies, respectively. With causality-preserving and -disturbing samples as positives and negatives, CAROTS performs contrastive learning to train an encoder whose latent space separates normal and abnormal samples based on causality. Moreover, CAROTS introduces a similarity-filtered one-class contrastive loss that encourages the contrastive learning process to gradually incorporate more semantically diverse samples with common causal relationships. Extensive experiments on five real-world and two synthetic datasets validate that the integration of causal relationships endows CAROTS with improved MTSAD capabilities.
Paperid:2613
Authors:Shengju Yu · Siwei Wang · Xinwang Liu · Yiu-ming Cheung · En Zhu
Abstract:
Despite remarkable advances, existing incomplete multiview clustering (IMC) methods typically leverage either perspective-shared or perspective-specific determinants to encode cluster representations. To address this limitation, we introduce a BACDL algorithm designed to explicitly capture both concurrently, thereby exploiting heterogeneous data more effectively. It chooses to bifurcate feature clusters and further alienate them to enlarge the discrimination. With distribution learning, it successfully couples view guidance into feature clusters to alleviate dimension inconsistency. Then, building on the principle that samples in one common cluster own similar marginal distribution and conditional distribution, it unifies the association between feature clusters and sample clusters to bridge all views. Thereafter, all incomplete sample clusters are reordered and mapped to a common one to formulate clustering embedding. Last, the overall linear overhead endows it with a resource-efficient characteristic.
Paperid:2614
Authors:Jonathan Scott · Christoph Lampert · David Saulpic
Abstract:
Abstract:Clustering is a cornerstone of data analysis that is particularly suited to identifying coherent subgroups or substructures in unlabeled data, as are generated continuously in large amounts these days. However, in many cases traditional clustering methods are not applicable, because data are increasingly being produced and stored in a distributed way, e.g. on edge devices, and privacy concerns prevent it from being transferred to a central server. To address this challenge, we present \acronym, a new algorithm for $k$means clustering that is fully-federated as well as differentially private. Our approach leverages (potentially small and out-of-distribution) server-side data to overcome the primary challenge of differentially private clustering methods: the need for a good initialization. Combining our initialization with a simple federated DP-Lloyds algorithm we obtain an algorithm that achieves excellent results on synthetic and real-world benchmark tasks. We also provide a theoretical analysis of our method that provides bounds on the convergence speed and cluster identification success.
Paperid:2615
Authors:Talal Widatalla · Richard Shuai · Brian Hie · Po-Ssu Huang
Abstract:
Leading deep learningbased methods for fixed-backbone protein sequence design do not model protein sidechain conformation during sequence generation despite the large role the three-dimensional arrangement of sidechain atoms play in protein conformation, stability, and overall protein function. Instead, these models implicitly reason about crucial sidechain interactions based solely on backbone geometry and amino-acid sequence. To address this, we present FAMPNN (Full-Atom MPNN), a sequence design method that explicitly models both sequence identity and sidechain conformation for each residue, where the per-token distribution of a residue’s discrete amino acid identity and its continuous sidechain conformation are learned with a combined categorical cross-entropy and diffusion loss objective. We demonstrate learning these distributions jointly is a highly synergistic task that both improves sequence recovery while achieving state-of-the-art sidechain packing. Furthermore, benefits from explicit full-atom modeling generalize from sequence recovery to practical protein design applications, such as zero-shot prediction of experimental binding and stability measurements.
Paperid:2616
Authors:Paulius Rauba · Qiyao Wei · M van der Schaar
Abstract:
We address the challenge of measuring how input perturbations impact large language model (LLM) outputs, a fundamental task for model reliability and posthoc interpretability. A key obstacle in this domain is disentangling the meaningful changes in model responses from the intrinsic stochasticity of LLM outputs. To overcome this, we introduceDistribution-Based Perturbation Analysis(DBPA), a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. DBPA constructs empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling. Comparisons of Monte Carlo estimates in the new space enables tractable frequentist inference without relying on restrictive distributional assumptions. The framework is (i) model-agnostic, (ii) supports the evaluation of arbitrary input perturbations on any black-box LLM, (iii) yields interpretable p-values, (iv) supports multiple perturbation testing via controlled error rates, and (v) provides scalar effect sizes for any chosen similarity or distance metric. We demonstrate the effectiveness of DBPA in evaluating perturbation impacts across multiple case studies.
Paperid:2617
Authors:Peter Halmos · Julian Gold · Xinhao Liu · Benjamin Raphael
Abstract:
Abstract:Optimal transport (OT) has enjoyed great success in machinelearning as a principled way to align datasets via a least-cost correspondence, with its success driven in large part by the efficiency of the Sinkhorn algorithm. However, the Sinkhorn algorithm and subsequent approaches have quadratic space complexity, limiting their scalability to larger datasets. Low-rank OT achieves linear-space complexity, but by definition cannot compute bijective correspondences. We prove that when a Monge map $T$ exists, the factors of an optimal low-rank coupling co-cluster each point $\mathbf{x} \in \mathsf{X}$ with its image under the Monge map. We derive an algorithm, _hierarchical refinement_ (`HR-OT`), that leverages this invariant, dynamically constructing a multiscale partition of each dataset using low-rank OT subproblems, culminating in a bijective transport plan. `HR-OT` uses _linear_ space and has runtime ranging from log-linear to quadratic. Thus, `HR-OT` retains the space advantage of low-rank OT, while overcoming its resolution limitations. We demonstrate `HR-OT`'s advantages on several datasets, including datasets containing over a million points, extending full-rank OT to datasets previously out of Sinkhorn's reach.
Paperid:2618
Authors:Vincent Cohen-Addad · Silvio Lattanzi · Simon Meierhans
Abstract:
We study the offline active learning problem on graphs. In this problem, one seeks to select k vertices whose labels are best suited for predicting the labels of all the other vertices in the graph.Guillory and Bilmes (Guillory & Bilmes, 2009) introduced a natural theoretical model motivated by a label smoothness assumption. Prior to our work, algorithms with theoretical guarantees were only known for restricted graph types such as trees (CesaBianchi et al., 2010) despite the models simplicity. We present the first O(log n)-resource augmented algorithm for general weighted graphs. To complement our algorithm, we show constant hardness of approximation.
Paperid:2619
Authors:Yue Wang · Qizhou Wang · Feng Liu · Wei Huang · Yali Du · Xiaojiang Du · Bo Han
Abstract:
Large language model (LLM) unlearning has demonstrated its essential role in removing privacy and copyrightrelated responses, crucial for their legal and safe applications. However, the pursuit of complete unlearning often comes with substantial costs due to its compromises in their general functionality, leading to a notorious trade-off between unlearning and retention. It motivates this paper to explore enhanced unlearning schemes that can mitigate this trade-off. Specifically, we propose Gradient Rectified Unlearning (GRU), an improved framework that regulates the directions of gradient updates during the unlearning procedure such that their side impacts on other, unrelated responses can be minimized. GRU is easy and general to implement, demonstrating practical effectiveness across a variety of well-established unlearning benchmarks.
Paperid:2620
Authors:Spencer Hutchinson · Mahnoosh Alizadeh
Abstract:
Abstract:In this work, we study online convex optimization with a fixed constraint function $g : \mathbb{R}^d \rightarrow \mathbb{R}$. Prior work on this problem has shown $O(\sqrt{T})$ regret and cumulative constraint satisfaction $\sum_{t=1}^{T} g(x_t) \leq 0$, while only accessing the constraint value and subgradient at the played actions $g(x_t), \partial g(x_t)$. Using the same constraint information, we show a stronger guarantee of anytime constraint satisfaction $g(x_t) \leq 0 \ \forall t \in [T]$, and matching $O(\sqrt{T})$ regret guarantees. These contributions are thanks to our approach of using Polyak feasibility steps to ensure constraint satisfaction, without sacrificing regret. Specifically, after each step of online gradient descent, our algorithm applies a subgradient descent step on the constraint function where the stepsize is chosen according to the celebrated Polyak step-size. We further validate this approach with numerical experiments.
Paperid:2621
Authors:Bo Zhao · Nima Dehmamy · Robin Walters · Rose Yu
Abstract:
Neural network minima are often connected by curves along which train and test loss remain nearly constant, a phenomenon known as mode connectivity. While this property has enabled applications such as model merging and finetuning, its theoretical explanation remains unclear. We propose a new approach to exploring the connectedness of minima using parameter space symmetry. By linking the topology of symmetry groups to that of the minima, we derive the number of connected components of the minima of linear networks and show that skip connections reduce this number. We then examine when mode connectivity and linear mode connectivity hold or fail, using parameter symmetries which account for a significant part of the minimum. Finally, we provide explicit expressions for connecting curves in the minima induced by symmetry. Using the curvature of these curves, we derive conditions under which linear mode connectivity approximately holds. Our findings highlight the role of continuous symmetries in understanding the neural network loss landscape.
Paperid:2622
Authors:Damjan Kalajdzievski
Abstract:
The field of mechanistic interpretability in pretrained transformer models has demonstrated substantial evidence supporting the ''linear representation hypothesis'', which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept's vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.
Paperid:2623
Authors:Zhendong Wang · Max Li · Ajay Mandlekar · Zhenjia Xu · Jiaojiao Fan · Yashraj Narang · Jim Fan · Yuke Zhu · Yogesh Balaji · Mingyuan Zhou · Ming-Yu Liu · Yu Zeng
Abstract:
Abstract:Diffusion models, praised for their success in generative tasks, are increasingly being applied to robotics, demonstrating exceptional performance in behavior cloning. However, their slow generation process stemming from iterative denoising steps poses a challenge for realtime applications in resource-constrained robotics setups and dynamically changing environments.In this paper, we introduce the One-Step Diffusion Policy (\ours), a novel approach that distills knowledge from pre-trained diffusion policies into a single-step action generator, significantly accelerating response times for robotic control tasks. We ensure the distilled generator closely aligns with the original policy distribution by minimizing the Kullback-Leibler (KL) divergence along the diffusion chain, requiring only $2\%$-$10\%$ additional pre-training cost for convergence. We evaluated \ours on 6 challenging simulation tasks as well as 4 self-designed real-world tasks using the Franka robot. The results demonstrate that \ours not only achieves state-of-the-art success rates but also delivers an order-of-magnitude improvement in inference speed, boosting action prediction frequency from 1.5 Hz to 62 Hz, establishing its potential for dynamic and computationally constrained robotic applications. A video demo is provided \href{https://drive.google.com/file/d/1eIa11gw6DwYKG9CKERy41bjE1ruklRtT/view?usp=sharing}{\textit{here}}, and the code will be publicly available soon.
Paperid:2624
Authors:Niclas Boehmer · Sara Fish · Ariel Procaccia
Abstract:
A key task in certain democratic processes is to produce a concise slate of statements that proportionally represents the full spectrum of user opinions. This task is similar to committee elections, but unlike traditional settings, the candidate set comprises all possible statements of varying lengths, and so it can only be accessed through specific queries. Combining social choice and large language models, prior work has approached this challenge through a framework of generative social choice. We extend the framework in two fundamental ways, providing theoretical guarantees even in the face of approximately optimal queries and a budget limit on the overall length of the slate. Using GPT4o to implement queries, we showcase our approach on datasets related to city improvement measures and drug reviews, demonstrating its effectiveness in generating representative slates from unstructured user opinions.
Paperid:2625
Authors:wenzhi Feng · Xutao Li · Zhe Wu · Yunming Ye · Yaowei Wang · Kenghong Lin · Demin Yu
Abstract:
Most current precipitation nowcasting methods aim to capture the underlying spatiotemporal dynamics of precipitation systems by minimizing the mean square error (MSE). However, these methods often neglect effective constraints on the data distribution, leading to unsatisfactory prediction accuracy and image quality, especially for long forecast sequences. To address this limitation, we propose a precipitation nowcasting model incorporating perceptual constraints. This model reformulates precipitation nowcasting as a posterior MSE problem under such constraints. Specifically, we first obtain the posteriori mean sequences of precipitation forecasts using a precipitation estimator. Subsequently, we construct the transmission between distributions using rectified flow. To enhance the focus on distant frames, we design a frame sampling strategy that gradually increases the corresponding weights. We theoretically demonstrate the reliability of our solution, and experimental results on two publicly available radar datasets demonstrate that our model is effective and outperforms current stateof-the-art models.
Paperid:2626
Authors:Georg Bökman · David Nordström · Fredrik Kahl
Abstract:
Incorporating geometric invariance into neural networks enhances parameter efficiency but typically increases computational costs.This paper introduces new equivariant neural networksthat preserve symmetry while maintaining a comparable number of floatingpoint operations (FLOPs) per parameter to standard non-equivariant networks. We focus on horizontal mirroring (flopping) invariance, common in many computer vision tasks.The main idea is to parametrize the feature spaces in terms of mirror-symmetric and mirror-antisymmetric features, i.e., irreps of the flopping group.This decomposes the linear layers to be block-diagonal, requiring half the number of FLOPs.Our approach reduces both FLOPs and wall-clock time,providing a practical solution for efficient, scalable symmetry-aware architectures.
Paperid:2627
Authors:Akash Kumar
Abstract:
The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative triplet comparisons. These features may represent various constructs, including dictionaries in LLMs or components of a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent's feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machinetrained models and dictionary extraction from sparse autoencoders trained on Large Language Models.
Paperid:2628
Authors:Eray Erturk · Fahad Kamran · Salar Abbaspourazad · Sean Jewell · Harsh Sharma · Yujie Li · Sinead A Williamson · Nicholas Foti · Joseph Futoma
Abstract:
Wearable devices record physiological and behavioral signals that can improve health predictions. While foundation models are increasingly used for such predictions, they have been primarily applied to lowlevel sensor data, despite behavioral data often being more informative due to their alignment with physiologically relevant timescales and quantities. We develop foundation models of such behavioral signals using over 2.5B hours of wearable data from 162K individuals, systematically optimizing architectures and tokenization strategies for this unique dataset. Evaluated on 57 health-related tasks, our model shows strong performance across diverse real-world applications including individual-level classification and time-varying health state prediction. The model excels in behavior-driven tasks like sleep prediction, and improves further when combined with representations of raw sensor data. These results underscore the importance of tailoring foundation model design to wearables and demonstrate the potential to enable new health applications.
Paperid:2629
Authors:Xiwen Chen · Wenhui Zhu · Peijie Qiu · Hao Wang · Huayu Li · ZIHAN LI · Yalin Wang · Aristeidis Sotiras · Abolfazl Razi
Abstract:
Abstract:Analyzing time series data is crucial to a wide spectrum of applications, including economics, online marketplaces, and human healthcare. In particular, time series classification plays an indispensable role in segmenting different phases in stock markets, predicting customer behavior, and classifying worker actions and engagement levels. These aspects contribute significantly to the advancement of automated decisionmaking and system optimization in real-world applications. However, there is a large consensus that time series data often suffers from domain shifts between training and test sets, which dramatically degrades the classification performance. Despite the success of (reversible) instance normalization in handling the domain shifts for time series regression tasks, its performance in classification is unsatisfactory. In this paper, we propose $\textit{FIC-TSC}$, a training framework for time series classification that leverages Fisher information as the constraint. We theoretically and empirically show this is an efficient and effective solution to guide the model converges toward flatter minima, which enhances its generalizability to distribution shifts. We rigorously evaluate our method on 30 UEA multivariate and 85 UCR univariate datasets. Our empirical results demonstrate the superiority of the proposed method over 14 recent state-of-the-art methods.
Paperid:2630
Authors:Min-Yeong Park · Won-Jeong Lee · Seong Tae Kim · Gyeong-Moon Park
Abstract:
Recently, forecasting future abnormal events has emerged as an important scenario to tackle realworld necessities. However, the solution of predicting specific future time points when anomalies will occur, known as Anomaly Prediction (AP), remains underexplored. Existing methods dealing with time series data fail in AP, focusing only on immediate anomalies or failing to provide precise predictions for future anomalies. To address AP, we propose a novel framework called Anomaly to Prompt (A2P), comprised of Anomaly-Aware Forecasting (AAF) and Synthetic Anomaly Prompting (SAP). To enable the forecasting model to forecast abnormal time points, we adopt a strategy to learn the relationships of anomalies. For the robust detection of anomalies, our proposed SAP introduces a learnable Anomaly Prompt Pool (APP) that simulates diverse anomaly patterns using signal-adaptive prompt. Comprehensive experiments on multiple real-world datasets demonstrate the superiority of A2P over state-of-the-art methods, showcasing its ability to predict future anomalies.
Paperid:2631
Authors:Guangyi Liu · Suzan Iloglu · Michael Caldara · Joseph Durham · Michael Zavlanos
Abstract:
In robotic warehouses, the destinationto-chute mapping problem is crucial for efficient package sorting. Often, however, this problem is complicated by uncertain and dynamic package induction rates, which can lead to increased package recirculation. To tackle this challenge, we introduce a Distributionally Robust Multi-Agent Reinforcement Learning (DRMARL) framework that learns a destination-to-chute mapping policy that is resilient to adversarial variations in induction rates. Specifically, DRMARL relies on group distributionally robust optimization (DRO) to learn a policy that performs well not only on average but also on each individual subpopulation of induction rates within the group that capture, for example, different seasonality or operation modes of the system. This approach is then combined with a novel contextual bandit-based predictor of the worst-case induction distribution for each state-action pair, significantly reducing the cost of exploration and thereby increasing the learning efficiency and scalability of our framework. Extensive simulations demonstrate that DRMARL achieves robust chute mapping in the presence of varying induction distributions, reducing package recirculation by an average of 80% in the simulation scenario.
Paperid:2632
Authors:Yuchen Zhang · Jian Zhou
Abstract:
Inverse generation problems, such as denoising without ground truth observations, is a critical challenge in many scientific inquiries and realworld applications. While recent advances in generative models like diffusion models, conditional flow matching, and consistency models achieved impressive results by casting generation as denoising problems, they cannot be directly used for inverse generation without access to clean data. Here we introduce Inverse Flow (IF), a novel framework that enables using these generative models for inverse generation problems including denoising without ground truth. Inverse Flow can be flexibly applied to nearly any continuous noise distribution and allows complex dependencies. We propose two algorithms for learning Inverse Flows, Inverse Flow Matching (IFM) and Inverse Consistency Model (ICM). Notably, to derive the computationally efficient, simulation-free inverse consistency model objective, we generalized consistency training to any forward diffusion processes or conditional flows, which have applications beyond denoising. We demonstrate the effectiveness of IF on synthetic and real datasets, outperforming prior approaches while enabling noise distributions that previous methods cannot support. Finally, we showcase applications of our techniques to fluorescence microscopy and single-cell genomics data, highlighting IF's utility in scientific problems. Overall, this work expands the applications of powerful generative models to inversion generation problems.
Paperid:2633
Authors:Mohammad Jalali · Bahar Dibaei Nia · Farzan Farnia
Abstract:
Feature embedding models have been widely applied to many downstream tasks across various domains. While several standard feature embeddings have been developed in the literature, comparisons of these embeddings have largely focused on their numerical performance in classificationrelated downstream applications. However, an interpretable comparison of different embeddings requires identifying and analyzing mismatches between similar sample groups clustered within the embedding spaces. In this work, we propose the Spectral Pairwise Embedding Comparison (SPEC) framework to compare embeddings and identify their differences in clustering a benchmark dataset. Our approach examines the kernel matrices derived from two embeddings and leverages the eigendecomposition of the differential kernel matrix to detect sample clusters that are captured differently by the two embeddings. We present a scalable implementation of this kernel-based approach, with computational complexity that grows linearly with the sample size. Furthermore, we introduce an optimization problem using this framework to align two embeddings, ensuring that clusters detected by one embedding are also captured by the other. Finally, we provide numerical results showcasing the application of the SPEC framework to compare and align standard embeddings on large-scale datasets such as ImageNet and MS-COCO.
Paperid:2634
Authors:Riccardo De Santi · Marin Vlastelica · Ya-Ping Hsieh · Zebang Shen · Niao He · Andreas Krause
Abstract:
Exploration is critical for solving realworld decision-making problems such as scientific discovery, where the objective is to generate truly novel designs rather than mimic existing data distributions. In this work, we address the challenge of leveraging the representational power of generative models for exploration without relying on explicit uncertainty quantification. We introduce a novel framework that casts exploration as entropy maximization over the approximate data manifold implicitly defined by a pre-trained diffusion model. Then, we present a novel principle for exploration based on density estimation, a problem well-known to be challenging in practice. To overcome this issue and render this method truly scalable, we leverage a fundamental connection between the entropy of the density induced by a diffusion model and its score function. Building on this, we develop an algorithm based on mirror descent that solves the exploration problem as sequential fine-tuning of a pre-trained diffusion model. We prove its convergence to the optimal exploratory diffusion model under realistic assumptions by leveraging recent understanding of mirror flows. Finally, we empirically evaluate our approach on both synthetic and high-dimensional text-to-image diffusion, demonstrating promising results.
Paperid:2635
Authors:Alessandro Palma · Sergei Rybakov · Leon Hetzel · Stephan Günnemann · Fabian Theis
Abstract:
Latent space interpolations are a powerful tool for navigating deep generative models, especially in applied contexts where they capture system trajectories on the manifolds of highdimensional data. An example is single-cell RNA sequencing (scRNA-seq), where existing methods model cellular state transitions as latent space interpolations with variational autoencoders, often assuming linear shifts and Euclidean geometry. However, unless explicitly enforced, linear interpolations in the latent space may not correspond to geodesic paths on the data manifold, limiting methods that assume Euclidean geometry in the representation space. We introduce FlatVI, a novel training framework that enforces Euclidean geometry in the latent manifold of discrete-likelihood variational autoencoders, specifically tailored for modelling single-cell count data. By regularising straight lines in the latent space to approximate geodesic interpolations on the decoded single-cell manifold, FlatVI enhances compatibility with downstream analyses that assume Euclidean latent geometry. Results on synthetic data validate our approach theoretically. At the same time, applications to temporally resolved scRNA-seq datasets demonstrate improved reconstruction of cellular trajectories and more accurate inference of biologically meaningful velocity fields.
Paperid:2636
Authors:Zhi Zheng · Zhuoliang Xie · Zhenkun Wang · Bryan Hooi
Abstract:
Handcrafting heuristics for solving complex optimization tasks (e.g., route planning and task allocation) is a common practice but requires extensive domain knowledge. Recently, Large Language Model (LLM)based automatic heuristic design (AHD) methods have shown promise in generating high-quality heuristics without manual interventions. Existing LLM-based AHD methods employ a population to maintain a fixed number of top-performing LLM-generated heuristics and introduce evolutionary computation (EC) to iteratively enhance the population. However, these population-based procedures cannot fully develop the potential of each heuristic and are prone to converge into local optima. To more comprehensively explore the space of heuristics, this paper proposes to use Monte Carlo Tree Search (MCTS) for LLM-based heuristic evolution. The proposed MCTS-AHD method organizes all LLM-generated heuristics in a tree structure and can better develop the potential of temporarily underperforming heuristics. In experiments, MCTS-AHD delivers significantly higher-quality heuristics on various complex tasks. Our code is available.
Paperid:2637
Authors:Linda Lu · Ayush Sekhari · Karthik Sridharan
Abstract:
Abstract:Machine unlearning addresses the problem of updating a machine learning model/system trained on a dataset $S$ so that the influence of a set of deletion requests $U \subseteq S$ on the unlearned model is minimized. The gold standard definition of unlearning demands that the updated model after deletion be nearly identical to the model obtained by retraining. This definition is designed for a worstcase attacker (one who can recover not only the unlearned model but also $S \setminus U$). Such a stringent definition has made developing efficient unlearning algorithms challenging. Such strong attackers are also unrealistic. In this work, we propose a new definition, *system-aware unlearning*, which aims to provide unlearning guarantees against an attacker that can at best only gain access to the data stored in the system for learning/unlearning requests and not $S \setminus U$. With this new definition, we use the simple intuition that if a system can store less to make its learning/unlearning updates, it can be more secure and update more efficiently against a system-aware attacker. We present an exact system-aware unlearning algorithm for linear classification using a selective sampling based approach. We theoretically analyze the tradeoffs between deletion capacity, accuracy, memory, and computation time and also show that our method can be practically efficient in handling deletion requests.
Paperid:2638
Authors:Yinbin Han · Meisam Razaviyayn · Renyuan Xu
Abstract:
Diffusion models have emerged as powerful tools for generative modeling, demonstrating exceptional capability in capturing target data distributions from large datasets. However, finetuning these massive models for specific downstream tasks, constraints, and human preferences remains a critical challenge. While recent advances have leveraged reinforcement learning algorithms to tackle this problem, much of the progress has been empirical, with limited theoretical understanding. To bridge this gap, we propose a stochastic control framework for fine-tuning diffusion models. Building on denoising diffusion probabilistic models as the pre-trained reference dynamics, our approach integrates linear dynamics control with Kullback–Leibler regularization. We establish the well-posedness and regularity of the stochastic control problem and develop a policy iteration algorithm (PI-FT) for numerical solution. We show that PI-FT achieves global convergence at a linear rate. Unlike existing work that assumes regularities throughout training, we prove that the control and value sequences generated by the algorithm maintain the desirable regularity. Finally, we study extensions of our framework to parametric settings for efficient implementation.
Paperid:2639
Authors:Shikang Liu · Chuyang Wei · Xiren Zhou · Huanhuan Chen
Abstract:
Analyzing inherent temporal dynamics is a critical pathway for time series classification, where Reservoir Computing (RC) exhibits effectiveness and high efficiency. However, typical RC considers recursive updates from adjacent states, struggling with longterm dependencies. In response, this paper proposes a Spectral-Aware Reservoir Computing framework (SARC), incorporating spectral insights to enhance long-term dependency modeling. Prominent frequencies are initially extracted to reveal explicit or implicit cyclical patterns. For each prominent frequency, SARC further integrates a Frequency-informed Reservoir Network (FreqRes) to adequately capture both sequential and cyclical dynamics, thereby deriving effective dynamic features. Synthesizing these features across various frequencies, SARC offers a multi-scale analysis of temporal dynamics and improves the modeling of long-term dependencies. Experiments on public datasets demonstrate that SARC achieves state-of-the-art results, while maintaining high efficiency compared to existing methods.
Paperid:2640
Authors:Shirley Wu · Michel Galley · Baolin Peng · Hao Cheng · Gavin Li · Yao Dou · Weixin Cai · James Zou · Jure Leskovec · Jianfeng Gao
Abstract:
Large Language Models are typically trained with nextturn rewards, limiting their ability to optimize for long-term interaction. As a result, they often respond passively to ambiguous or open-ended user requests, failing to help users reach their ultimate intents and leading to inefficient conversations. To address these limitations, we introduce CollabLLM, a novel and general training framework that enhances multiturn human-LLM collaboration. Its key innovation is a collaborative simulation that estimates the long-term contribution of responsesusing Multiturn-aware Rewards. By reinforcement fine-tuning these rewards, CollabLLM goes beyond responding to user requests, and actively uncovers user intent and offers insightful suggestions—a key step towards more human-centered AI. We also devise a multiturn interaction benchmark with three challenging tasks such as document creation. CollabLLM significantly outperforms our baselines with averages of 18.5% higher task performance and 46.3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where CollabLLM increases user satisfaction by 17.6% and reduces user spent time by 10.4%.
Paperid:2641
Authors:Zhaorun Chen · Mintong Kang · Bo Li
Abstract:
LLM agents have demonstrated wide adoption in various applications, leveraging their ability to access sensitive data and make autonomous decisions. However, they remain highly vulnerable to malicious instructions and attacks, which can result in severe consequences such as privacy breaches and financial losses. More critically, existing guardrails for LLMs are not applicable due to the complex and dynamic nature of LLM agents. To tackle these challenges, we propose ShieldAgent, the first LLMbased guardrail agent designed to enforce explicit safety policy compliance of the action sequences of other (multi-modal) LLM agents via automated probabilistic reasoning. Specifically, ShieldAgent constructs a structured safety model by extracting verifiable rules from policy documents and organizing them into action-conditioned probabilistic circuits. During inference, it first localizes relevant rule circuits w.r.t. the proposed action, then generates a shielding plan with operations supported by a rich tool library, and produces executable code for formal verification. Then it performs probabilistic inference to assign safety labels and report rule violations. Recognizing the lack of guardrail benchmarks for LLM agents, we introduce ShieldAgent-Web, a dataset with 2K safety-related instructions across 7 risk categories and 6 web environments, each paired with risky trajectories generated under two SOTA attacks. Experiments show that ShieldAgent achieves SOTA performance on ShieldAgent-Web and existing benchmarks, outperforming prior methods by 11.3% on average with a high rule recall of 90.1%. Additionally, ShieldAgent reduces API queries by 64.7% and inference time by 58.2%, demonstrating its high precision and efficiency in safeguarding LLM agents.
Paperid:2642
Authors:Jaeyeon Kim · Kulin Shah · Vasilis Kontonis · Sham Kakade · Sitan Chen
Abstract:
Abstract:In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode tokens in essentially arbitrary order. In this work we closely examine these two competing effects. On the training front, we theoretically and empirically demonstrate that MDMs indeed train on computationally intractable subproblems compared to their autoregressive counterparts. On the inference front, we show that a suitable strategy for adaptively choosing the token decoding order significantly enhances the capabilities of MDMs, allowing them to sidestep hard subproblems. On logic puzzles like Sudoku, we show that adaptive inference can boost solving accuracy in pretrained MDMs from $<7$\% to $\approx 90$\%, even outperforming ARMs that were explicitly trained via teacher forcing to learn the right order of decoding.
Paperid:2643
Authors:Haotian Fu · Yixiang Sun · Michael L. Littman · George Konidaris
Abstract:
We propose DRAGO, a novel approach for continual modelbased reinforcement learning aimed at improving the incremental development of world models across a sequence of tasks that differ in their reward functions but not the state space or dynamics. DRAGO comprises two key components:Synthetic Experience Rehearsal, which leverages generative models to create synthetic experiences from past tasks, allowing the agent to reinforce previously learned dynamics without storing data, andRegaining Memories Through Exploration, which introduces an intrinsic reward mechanism to guide the agent toward revisiting relevant states from prior tasks. Together, these components enable the agent to maintain a comprehensive and continually developing world model, facilitating more effective learning and adaptation across diverse environments. Empirical evaluations demonstrate that DRAGO is able to preserve knowledge across tasks, achieving superior performance in various continual learning scenarios.
Paperid:2644
Authors:Cheng Jin · Chutao Liu · Zhenyu Xiao · Yuantao Gu
Abstract:
Classifierfree guidance (CFG) has emerged as a pivotal advancement in text-to-image latent diffusion models, establishing itself as a cornerstone technique for achieving high-quality image synthesis. However, under high guidance weights, where text-image alignment is significantly enhanced, CFG also leads to pronounced color distortions in the generated images. We identify that these distortions stem from the amplification of sample norms in the latent space. We present a theoretical framework that elucidates the mechanisms of norm amplification and anomalous diffusion phenomena induced by classifier-free guidance. Leveraging our theoretical insights and the latent space structure, we propose an Angle Domain Guidance (ADG) algorithm. ADG constrains magnitude variations while optimizing angular alignment, thereby mitigating color distortions while preserving the enhanced text-image alignment achieved at higher guidance weights. Experimental results demonstrate that ADG significantly outperforms existing methods, generating images that not only maintain superior text alignment but also exhibit improved color fidelity and better alignment with human perceptual preferences.
Paperid:2645
Authors:Roman Bachmann · Jesse Allardice · David Mizrahi · Enrico Fini · Oguzhan Fatih Kar · Elmira Amirloo · Alaaeldin Ali · Amir Zamir · Afshin Dehghan
Abstract:
Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid redundancies. However, these methods typically use a fixed number of tokens and thus cannot adapt to an image's inherent complexity. We introduce FlexTok, a tokenizer that projects 2D images into variablelength, ordered 1D token sequences. For example, a 256x256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token sequence length. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On ImageNet, this approach achieves an FID<2 across 8 to 128 tokens, outperforming TiTok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how FlexTok relates to traditional 2D tokenization. A key finding is that FlexTok enables next-token prediction to describe images in a coarse-to-fine "visual vocabulary", and that the number of tokens to generate depends on the complexity of the generation task.
Paperid:2646
Authors:Rui Yang · Hanyang(Jeremy) Chen · Junyu Zhang · Mark Zhao · Cheng Qian · Kangrui Wang · Qineng Wang · Teja Koripella · Marziyeh Movahedi · Manling Li · Heng Ji · Huan Zhang · Tong Zhang
Abstract:
Abstract:Leveraging Multimodal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents.EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like common sense reasoning, complex instruction following, spatial awareness, visual perception, and long-term planning.Through extensive experiments, we evaluate 13 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only $28.9\\%$ on average. EmbodiedBench provides a standardized, multifaceted evaluation platform that not only highlights existing challenges but also offers valuable insights for advancing MLLM-based embodied agents.
Paperid:2647
Authors:Ankit Bhardwaj · Amar Phanishayee · Deepak Narayanan · Ryan Stutsman
Abstract:
In this paper, we investigate how to push the performance limits of serving Deep Neural Network (DNN) models on CPUbased servers. Specifically, we observe that while intra-operator parallelism across multiple threads is an effective way to reduce inference latency, it provides diminishing returns. Our primary insight is that instead of running a single instance of a model with all available threads on a server, running multiple instances each with smaller batch sizes and fewer threads for intra-op parallelism can provide lower inference latency. However, the right configuration is hard to determine manually since it is workload- (DNN model and batch size used by the serving system) and deployment-dependent (number of CPU cores on server).We present Packrat, a new serving system for online inference that given a model and batch size (𝐵) algorithmically picks the optimal number of instances (𝑖), the number of threads each should be allocated (𝑡), and the batch sizes each should operate on (𝑏) that minimizes latency. Packrat is built as an extension to TorchServe and supports online reconfigurations to avoid serving downtime. Averaged across a range of batch sizes, Packrat improves inference latency by 1.43× to 1.83× on a range of commonly used DNNs.
Paperid:2648
Authors:Tianyu Gao · Alexander Wettig · Luxi He · Yihe Dong · Sadhika Malladi · Danqi Chen
Abstract:
The vast diversity of styles, domains, and quality levels present in language model pretraining corpora is essential in developing general model capabilities, but efficiently learning and deploying the correct behaviors exemplified in each of these heterogeneous data sources is challenging. To address this, we propose a new method, termed Metadata Conditioning then Cooldown (MeCo), to incorporate additional learning cues during pre-training. MeCo first provides metadata (e.g., URLs like en.wikipedia.org) alongside the text during training and later uses a cooldown phase with only the standard text, thereby enabling the model to function normally even without metadata. MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM). For instance, a 1.6B language model trained with MeCo matches the downstream task performance of standard pre-training while using 33% less data. Additionally, MeCo enables us to steer language models by conditioning the inference prompt on either real or fabricated metadata that encodes the desired properties of the output: for example, prepending wikipedia.org to reduce harmful generations or factquizmaster.com (fabricated) to improve common knowledge task performance. We also demonstrate that MeCo is compatible with different types of metadata, such as model-generated topics. MeCo is remarkably simple, adds no computational overhead, and demonstrates promise in producing more capable and steerable language models.
Paperid:2649
Authors:Junbin Liu · Wing-Kin Ma · Farzan Farnia
Abstract:
Multilayer matrix factorization (MMF) has recently emerged as a more general model of, and potentially a more powerful approach than, the classic matrix factorization.This paper considers MMF under a probabilistic formulation, and our focus is on inference methods under variational inference.The challenge in this context lies in determining a variational process that leads to a computationally efficient and accurate approximation of the maximum likelihood inference. One example is the variational autoencoder (VAE), which uses neural networks for the variational process.In this work, we take insight from variational diffusion models in the context of generative models to develop variational inference for MMF.We propose a dimensionreducing diffusion process that provides a new way to interact with the layered structures of the MMF model.Experimental results demonstrate that the proposed diffusion variational inference method yields better performance than some existing methods such as the VAE.
Paperid:2650
Authors:Dake Bu · Wei Huang · Andi Han · Atsushi Nitanda · Taiji Suzuki · Qingfu Zhang · Hau-San Wong
Abstract:
Incontext learning (ICL) has garnered significant attention for its ability to grasp functions and tasks from demonstrations. Recent empirical studies suggest the presence of aTask/Function Vectorin the latent geometry of large language models (LLMs) during ICL. Merullo et al. (2024) showed that LLMs leverage this vector alongside the residual stream for Word2Vec-like vector arithmetic, solving factual-recall ICL tasks. Additionally, recent work empirically highlighted the key role of Question-Answer data in enhancing factual-recall capabilities. Despite these insights, a theoretical explanation remains elusive. To move one step forward, this work provides a theoretical framework building on empirically grounded hierarchical concept modeling. We develop an optimization theory, showing how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via task vector arithmetic. We prove 0-1 loss convergence and highlight their superior generalization capabilities, adeptly handling concept recombinations and data shifts. These findings underscore the advantages of transformers over static word embedding methods. Empirical simulations corroborate our theoretical insights.
Paperid:2651
Authors:MINGJIA YIN · Junwei Pan · Hao Wang · Ximei Wang · Shangyu Zhang · Jie Jiang · Defu Lian · Enhong Chen
Abstract:
ClickThrough Rate (CTR) prediction models estimate the probability of users clicking on items based on feature interactions, inherently following a discriminative paradigm. However, this paradigm is prone to embedding dimensional collapse and information redundancy due to limitations of vanilla feature embeddings.This motivates us to reformulate it into a generative paradigm to generate new feature embeddings. Unlike sequential recommendation, which naturally fits a generative "next-item prediction" paradigm, it's hard to formulate CTR models into this paradigm without explicit feature order.Therefore, we propose a novel Supervised Feature Generation framework for CTR models, shifting from the discriminative "feature interaction" paradigm to the generative "feature generation" paradigm.Specifically, we predict each feature embedding based on the concatenation of all feature embeddings.Besides, this paradigm naturally accommodates a supervised binary cross-entropy loss to indicate whether the sample is positive or negative.The framework can reformulate nearly every existing CTR models and bring significant performance lifts.Moreover, it produces less-collapsed and redundancy-reduced feature embeddings, thereby mitigating the inherent limitations of the discriminative paradigm.
Paperid:2652
Authors:Yushuai Li · Hengyu Liu · Torben Pedersen · Yuqiang He · Kim Larsen · Lu Chen · Christian Jensen · Jiachen Xu · TIANYI LI
Abstract:
Abstract:The MuZero reinforcement learning method has achieved superhuman performance at games, and advances that enable MuZero to contend with complex actions now enable use of MuZeroclass methods in real-world decision-making applications. However, some real-world applications are susceptible to state perturbations caused by malicious attacks and noisy sensors. To enhance the robustness of MuZero-class methods to state perturbations, we propose RobustZero, the first MuZero-class method that is $\underline{robust}$ to worst-case and random-case state perturbations, with $\underline{zero}$ prior knowledge of the environment’s dynamics. We present a training framework for RobustZero that features a self-supervised representation network, targeting the generation of a consistent initial hidden state, which is key to obtain consistent policies before and after state perturbations, and it features a unique loss function that facilitates robustness. We present an adaptive adjustment mechanism to enable model update, enhancing robustness to both worst-case and random-case state perturbations. Experiments on two classical control environments, three energy system environments, and three transportation environments demonstrate that RobustZero can outperform state-of-the-art methods at defending against state perturbations.
Paperid:2653
Authors:Jinbo Wang · Mingze Wang · Zhanpeng Zhou · Junchi Yan · Weinan E · Lei Wu
Abstract:
Abstract:Transformers have become the cornerstone of modern AI. Unlike traditional architectures, transformers exhibit a distinctive characteristic: diverse types of building blocks, such as embedding layers, normalization layers, selfattention mechanisms, and point-wise feed-forward networks, work collaboratively. Understanding the disparities and interactions among these blocks is therefore important.In this paper, we uncover a clear **sharpness disparity** across these blocks, which intriguingly emerges early in training and persists throughout the training process.Building on this insight, we propose a novel **Blockwise Learning Rate (LR)** strategy to accelerate large language model (LLM) pre-training. Specifically, by integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. This improvement is demonstrated across GPT-2 and LLaMA models, with model sizes ranging from 0.12B to 1.1B and datasets including OpenWebText and MiniPile.Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory savings. These results underscore the potential of leveraging the sharpness disparity principle to improve LLM training.
Paperid:2654
Authors:Jinhang Chai · Elynn Chen · Lin Yang
Abstract:
Abstract:To bridge the gap between empirical success and theoretical understanding in transfer reinforcement learning (RL), we study a principled approach with provable performance guarantees. We introduce a novel composite MDP framework where highdimensional transition dynamics are modeled as the sum of a low-rank component representing shared structure and a sparse component capturing task-specific variations. This relaxes the common assumption of purely low-rank transition models, allowing for more realistic scenarios where tasks share core dynamics but maintain individual variations. We introduce UCB-TQL (Upper Confidence Bound Transfer Q-Learning), designed for transfer RL scenarios where multiple tasks share core linear MDP dynamics but diverge along sparse dimensions. When applying UCB-TQL to a target task after training on a source task with sufficient trajectories, we achieve a regret bound of $\tilde{\mathcal{O}}(\sqrt{eH^5N})$ that scales independently of the ambient dimension. Here, $N$ represents the number of trajectories in the target task, while $e$ quantifies the sparse differences between tasks. This result demonstrates substantial improvement over single task RL by effectively leveraging their structural similarities. Our theoretical analysis provides rigorous guarantees for how UCB-TQL simultaneously exploits shared dynamics while adapting to task-specific variations.
Paperid:2655
Authors:Toby Boyne · Jose Pablo Folch · Robert Lee · Behrang Shafei · Ruth Misener
Abstract:
We perform Bayesian optimization using a Gaussian process perspective on Bayesian Additive Regression Trees (BART). Our BART Kernel (BARK) uses tree agreement to define a posterior over piecewiseconstant functions, and we explore the space of tree kernels using a Markov chain Monte Carlo approach. Where BART only samples functions, the resulting BARK model obtains samples of Gaussian processes defining distributions over functions, which allow us to build acquisition functions for Bayesian optimization. Our tree-based approach enables global optimization over the surrogate, even for mixed-feature spaces. Moreover, where many previous tree-based kernels provide uncertainty quantification over function values, our sampling scheme captures uncertainty over the tree structure itself. Our experiments show the strong performance of BARK on both synthetic and applied benchmarks, due to the combination of our fully Bayesian surrogate and the optimization procedure.
Paperid:2656
Authors:Francesco Bacchiocchi · Jiarui Gan · Matteo Castiglioni · Alberto Marchesi · Nicola Gatti
Abstract:
Principalagent problems model scenarios where a principal aims at incentivizing an agent to take costly, unobservable actions through the provision of payments. Such interactions are ubiquitous in several real-world applications, ranging from blockchain to the delegation of machine learning tasks. In this paper, we initiate the study of hidden-action principal-agent problems under approximate best responses, in which the agent may select any action that is not too much suboptimal given the principal's payment scheme (a.k.a. contract). Our main result is a polynomial-time algorithm to compute an optimal contract under approximate best responses. This is perhaps surprising, as computing an optimal commitment under approximate best responses is known to be computationally intractable in Stackelberg games. We also investigate the learnability of contracts under approximate best responses, by providing a no-regret learning algorithm for a natural application scenario where the principal does not know anything about the environment.
Paperid:2657
Authors:Gaurush Hiranandani · Haolun Wu · Subhojyoti Mukherjee · Sanmi Koyejo
Abstract:
Many commercial Large Language Models (LLMs) are often closedsource, limiting developers to prompt tuning for aligning content generation with specific applications. While these models currently do not provide access to token logits, we argue that if such access were available, it would enable more powerful adaptation techniques beyond prompt engineering. In this paper, we propose a token-level probability reweighting framework that, given access to logits and a small amount of task-specific data, can effectively steer black-box LLMs toward application-specific content generation. Our approach views next-token prediction through the lens of supervised classification. We show that aligning black-box LLMs with task-specific data can be formulated as a label noise correction problem, leading to \emph{Plugin} model -- an autoregressive probability reweighting model that operates solely on logits. We provide theoretical justification for why reweighting logits alone is sufficient for task adaptation. Extensive experiments with multiple datasets, LLMs, and reweighting models demonstrate the effectiveness of our method, advocating for broader access to token logits in closed-source models.
Paperid:2658
Authors:Shuo He · Zhifang Zhang · Feng Liu · Roy Lee · Bo An · Lei Feng
Abstract:
We present a comprehensive empirical study on how backdoor attacks affect CLIP by analyzing the representations of backdoor images. Specifically, based on the methodology of representation decomposing, image representations can be decomposed into a sum of representations across individual image patches, attention heads (AHs), and multilayer perceptrons (MLPs) in different model layers. By examining the effect of backdoor attacks on model components, we have the following empirical findings. (1) Different backdoor attacks would infect different model components, i.e., local patch-based backdoor attacks mainly affect AHs, while global perturbation-based backdoor attacks mainly affect MLPs. (2) Infected AHs are centered on the last layer, while infected MLPs are decentralized on several late layers. (3) Not all AHs in the last layer are infected and even some AHs could still maintain the original property-specific roles (e.g., ''color" and ''location''). These observations motivate us to defend against backdoor attacks by detecting infected AHs, repairing their representations, or filtering backdoor samples with too many infected AHs, in the inference stage. Experimental results validate our empirical findings and demonstrate the effectiveness of the defense methods.
Paperid:2659
Authors:Ruiyu Wang · Yu Yuan · Shizhao Sun · Jiang Bian
Abstract:
Creating ComputerAided Design (CAD) models requires significant expertise and effort. Text-to-CAD, which converts textual descriptions into CAD parametric sequences, is crucial in streamlining this process. Recent studies have utilized ground-truth parametric sequences, known as sequential signals, as supervision to achieve this goal.However, CAD models are inherently multimodal, comprising parametric sequences and corresponding rendered visual objects.Besides, the rendering process from parametric sequences to visual objects is many-to-one.Therefore, both sequential and visual signals are critical for effective training.In this work, we introduce CADFusion, a framework that uses Large Language Models (LLMs) as the backbone and alternates between two training stages: the sequential learning (SL) stage and the visual feedback (VF) stage. In the SL stage, we train LLMs using ground-truth parametric sequences, enabling the generation of logically coherent parametric sequences. In the VF stage, we reward parametric sequences that render into visually preferred objects and penalize those that do not, allowing LLMs to learn how rendered visual objects are perceived and evaluated.These two stages alternate throughout the training, ensuring balanced learning and preserving benefits of both signals.Experiments demonstrate that CADFusion significantly improves performance, both qualitatively and quantitatively.
Paperid:2660
Authors:Xiaojun Xu · jinghan jia · Yuanshun Yao · Yang Liu · Hang Li
Abstract:
We propose an imperceptible multibit text watermark embedded by paraphrasing with LLMs. We fine-tune a pair of LLM paraphrasers that are designed to behave differently so that their paraphrasing difference reflected in the text semantics can be identified by a trained decoder. To embed our multi-bit watermark, we use two paraphrasers alternatively to encode the pre-defined binary code at the sentence level. Then we use a text classifier as the decoder to decode each bit of the watermark. Through extensive experiments, we show that our watermarks can achieve over 99.99% detection AUC with small (1.1B) text paraphrasers while keeping the semantic information of the original sentence. More importantly, our pipeline is robust under word substitution and sentence paraphrasing perturbations and generalizes well to out-of-distributional data. We also show the stealthiness of our watermark with LLM-based evaluation.
Paperid:2661
Authors:Chen Jin · Ryutaro Tanno · Amrutha Saseendran · Tom Diethe · Philip Teare
Abstract:
We introduceLavender, a simple supervised finetuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) leveraging the image understanding of Stable Diffusion models. Lavender directly aligns the core transformer attention in VLMs with the attention maps used by Stable Diffusion during SFT, instead of adapting separate encoders. This precise alignment enriches the model’s visual understanding and substantially improves text generation quality. Lavender requires just 0.13 million training examples—2.5\% of typical large-scale SFT datasets—and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves SoTA VLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30\% gains and a 68\% boost on challenging out-of-distribution tasks like WorldMedQA. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender provides a scalable solution for building more accurate and efficient vision-language systems. All code, training data, and pretrained models will be released to the community.
Paperid:2662
Authors:Tuan Dam · Pascal Stenger · Lukas Schneider · Joni Pajarinen · Carlo D'Eramo · Odalric-Ambrym Maillard
Abstract:
Abstract:This paper introduces a novel backup strategy for MonteCarlo Tree Search (MCTS) tailored for highly stochastic and partially observable Markov decision processes. We adopt a probabilistic approach, modeling both value and action-value nodes as Gaussian distributions, to introduce a novel backup operator that computes value nodes as the Wasserstein barycenter of their action-value children nodes; thus, propagating the uncertainty of the estimate across the tree to the root node. We study our novel backup operator when using a novel combination of $L^1$-Wasserstein barycenter with $\alpha$-divergence, by drawing a crucial connection to the generalized mean backup operator. We complement our probabilistic backup operator with two sampling strategies, based on optimistic selection and Thompson sampling, obtaining our Wasserstein MCTS algorithm. We provide theoretical guarantees of asymptotic convergence of $\mathcal{O}(n^{-1/2})$, with $n$ as the number of visited trajectories, to the optimal policy and an empirical evaluation on several stochastic and partially observable environments, where our approach outperforms well-known related baselines.
Paperid:2663
Authors:Daiqing Wu · Dongbao Yang · Sicheng Zhao · Yu ZHOU · Can Ma
Abstract:
The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zeroshot paradigm. This paradigm sidesteps the cost of model fine-tuning, emerging as a dominant trend in practical application. Nevertheless, Multimodal Sentiment Analysis (MSA), a pivotal challenge in the quest for general artificial intelligence, fails to accommodate this convenience. The zero-shot paradigm exhibits undesirable performance on MSA, casting doubt on whether MLLMs can perceive sentiments as competent as supervised models. By extending the zero-shot paradigm to In-Context Learning (ICL) and conducting an in-depth study on configuring demonstrations, we validate that MLLMs indeed possess such capability. Specifically, three key factors that cover demonstrations' retrieval, presentation, and distribution are comprehensively investigated and optimized. A sentimental predictive bias inherent in MLLMs is also discovered and later effectively counteracted. By complementing each other, the devised strategies for three factors result in average accuracy improvements of 15.9% on six MSA datasets against the zero-shot paradigm and 11.2% against the random ICL baseline.
Paperid:2664
Authors:Ye Mao · Weixun Luo · Junpeng Jing · Anlan Qiu · Krystian Mikolajczyk
Abstract:
The rise of visionlanguage foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce Hypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering (VQA) benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art foundation models struggle to reason effectively in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the context change is irrelevant to the question, models often incorrectly adjust their answers. The benchmark will be publicly released.
Paperid:2665
Authors:Haoyu Wei · Runzhe Wan · Lei Shi · Rui Song
Abstract:
Many realworld bandit applications are characterized by sparse rewards, which can significantly hinder learning efficiency. Leveraging problem-specific structures for careful distribution modeling is recognized as essential for improving estimation efficiency in statistics. However, this approach remains under-explored in the context of bandits. To address this gap, we initiate the study of zero-inflated bandits, where the reward is modeled using a classic semi-parametric distribution known as the zero-inflated distribution. We develop algorithms based on the Upper Confidence Bound and Thompson Sampling frameworks for this specific structure. The superior empirical performance of these methods is demonstrated through extensive numerical studies.
Paperid:2666
Authors:Yuchang Zhu · Huizhe Zhang · Bingzhe Wu · Jintang Li · Zibin Zheng · Peilin Zhao · Liang Chen · Yatao Bian
Abstract:
Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets—an aspect crucial for robust model performance—remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversityrelated axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore shows a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing approaches. Our code is available via: https://anonymous.4open.science/r/DCScore-ICML.
Paperid:2667
Authors:Bingyi Kang · Yang Yue · Rui Lu · Zhijie Lin · Yang Zhao · Kaixin Wang · Gao Huang · Jiashi Feng
Abstract:
Abstract:Scaling video generation models is believed to be promising in building world models that adhere to fundamental physical laws. However, whether these models can discover physical laws purely from vision can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios.In this work, we evaluate across three key scenarios: indistribution, out-of-distribution, and combinatorial generalization.We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws.We focus on the scaling behavior of trainingdiffusion-based video generation models to predict object movements based on initial frames.Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios.Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color $>$ size $>$ velocity $>$ shape.Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws.
Paperid:2668
Authors:Dongxiu Liu · Haoyi Niu · Zhihao Wang · Jinliang Zheng · Yinan Zheng · Zhonghong Ou · Jianming HU · Jianxiong Li · Xianyuan Zhan
Abstract:
Current robotic planning methods often rely on predicting multiframe images with full pixel details. While this fine-grained approach can serve as a generic world model, it introduces two significant challenges for downstream policy learning: substantial computational costs that hinder real-time deployment, and accumulated inaccuracies that can mislead action extraction. Planning with coarse-grained subgoals partially alleviates efficiency issues. However, their forward planning schemes can still result in off-task predictions due to accumulation errors, leading to misalignment with long-term goals. This raises a critical question: Can robotic planning be both efficient and accurate enough for real-time control in long-horizon, multi-stage tasks?To address this, we propose aBackwardPlanning scheme inLatent space (LBP), which begins by grounding the task into final latent goals, followed by recursively predicting intermediate subgoals closer to the current state. The grounded final goal enables backward subgoal planning to always remain aware of task completion, facilitating on-task prediction along the entire planning horizon. The subgoal-conditioned policy incorporates a learnable token to summarize the subgoal sequences and determines how each subgoal guides action extraction.Through extensive simulation and real-robot long-horizon experiments, we show that LBP outperforms existing fine-grained and forward planning methods, achieving SOTA performance. Project Page:https://lbp-authors.github.io.
Paperid:2669
Authors:Qi Yang · Chenghao Zhang · Lubin Fan · Kun Ding · Jieping Ye · Shiming Xiang
Abstract:
Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal RetrievalAugmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and more consistent responses. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on multiple VQA datasets, significantly outperforming In-Context Learning (ICL) and Vanilla-RAG methods. It highlights the effectiveness of our knowledge base and re-ranking method in improving LVLMs.
Paperid:2670
Authors:Albert Gong · Kamilė Stankevičiūtė · Chao Wan · Anmol Kabra · Raphael Thesmar · Johann Lee · Julius Klenke · Carla Gomes · Kilian Weinberger
Abstract:
Highquality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities, respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities.
Paperid:2671
Authors:Kapilan Balagopalan · Tuan Nguyen · Yao Zhao · Kwang-Sung Jun
Abstract:
The best arm identification problem requires identifying the best alternative (i.e., arm) in active experimentation using the smallest number of experiments (i.e., arm pulls), which is crucial for costefficient and timely decision-making processes. In the fixed confidence setting, an algorithm must stop data-dependently and return the estimated best arm with a correctness guarantee. Since this stopping time is random, we desire its distribution to have light tails. Unfortunately, many existing studies focus on high probability or in expectation bounds on the stopping time, which allow heavy tails and, for high probability bounds, even not stopping at all. We first prove that this never-stopping event can indeed happen for some popular algorithms. Motivated by this, we propose algorithms that provably enjoy an exponential-tailed stopping time, which improves upon the polynomial tail bound reported by Kalyanakrishnan et al. (2012). The first algorithm is based on a fixed budget algorithm called Sequential Halving along with a doubling trick. The second algorithm is a meta algorithm that takes in any fixed confidence algorithm with a high probability stopping guarantee and turns it into one that enjoys an exponential-tailed stopping time. Our results imply that there is much more to be desired for contemporary fixed confidence algorithms.
Paperid:2672
Authors:Ryan McKenna · Yangsibo Huang · Amer Sinha · Borja de Balle Pigem · Zachary Charles · Christopher A. Choquette Choo · Badih Ghazi · Georgios Kaissis · Ravi Kumar · Ruibo Liu · Da Yu · Chiyuan Zhang
Abstract:
Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale, and provide guidance on important hyperparameter choices that would otherwise be expensive. LLMs also rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data. Training models on this sensitive user data requires careful privacy protections like differential privacy (DP). However, the dynamics of DP training are significantly different, and consequently their scaling laws are not yet fully understood. In this work, we establish scaling laws that accurately model the intricacies of DP LLM training, providing a complete picture of the compute-privacy-utility and the optimal training configurations in many settings.
Paperid:2673
Authors:Guofeng Ding · Yiding Lu · Peng Hu · Mouxing Yang · Yijie Lin · Xi Peng
Abstract:
Textto-visual retrieval often struggle with semantic redundancy and granularity mismatches between textual queries and visual content. Unlike existing methods that address these challenges during training, we propose Visual Abstraction (VISA), a test-time approach that enhances retrieval by transforming visual content into textual descriptions using off-the-shelf large models.The generated text descriptions, with their dense semantics, naturally filter out low-level redundant visual information. To further address granularity issues, VISA incorporates a question-answering process, enhancing the text description with the specific granularity information requested by the user. Extensive experiments demonstrate that VISA achieves state-of-the-art performance in text-to-image and text-to-video retrieval across both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems. The code will be publicly released on GitHub upon acceptance.
Paperid:2674
Authors:Xinyan Liang · Ruijie Sang · Yuhua Qian · Qian Guo · Feijiang Li · Liang Du
Abstract:
Automatic Modulation Classification (AMC) serves as a foundational pillar for cognitive radio systems, enabling critical functionalities including dynamic spectrum allocation, noncooperative signal surveillance, and adaptive waveform optimization. However, practical deployment of AMC faces a fundamental challenge: prediction ambiguity arising from intrinsic similarity among modulation schemes and exacerbated under low signal-to-noise ratio (SNR) conditions. This phenomenon manifests as near-identical probability distributions across confusable modulation types, significantly degrading classification reliability. To address this, we propose Fuzzy Regularization-enhanced AMC (FR-AMC), a novel framework that integrates uncertainty quantification into the classification pipeline. The proposed FR has three features: (1) Explicitly models prediction ambiguity during backpropagation, (2) Dynamic sample reweighting through adaptive loss scaling, (3) Encourages margin maximization between confusable modulation clusters. Experimental results on benchmark datasets demonstrate that the FR achieves superior classification accuracy and robustness compared to compared methods, making it a promising solution for real-world spectrum management and communication applications.
Paperid:2675
Authors:Emanuel Ben Baruch · Adam Botach · Igor Kviatkovsky · Manoj Aggarwal · Gerard Medioni
Abstract:
With the increasing size of datasets used for training neural networks, data pruning has gained traction in recent years. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on groundtruth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images retained by typical pruning algorithms. Finally, we make an intriguing observation: when using lower pruning fractions, larger teachers lead to accuracy degradation, while surprisingly, employing teachers with a smaller capacity than the student's may improve results. Our code will be made available.
Paperid:2676
Authors:Weiyu CHEN · James Kwok
Abstract:
Model merging, which combines multiple models into a single model, has gained popularity in recent years. By efficiently integrating the capabilities of various models, this significantly reduces the parameter count and memory usage. However, current methods can only produce one single merged model. This necessitates a performance tradeoff due to conflicts among the various models, and the resultant one-size-fits-all model may not align with the preferences of different users who may prioritize certain models over others. To address this issue, we propose preference-aware model merging, and formulate this as a multi-objective optimization problem in which the performance of the merged model on each base model's task is treated as an objective. In a single merging process, the proposed parameter-efficient structure generates a Pareto set of merged models, with each representing a Pareto-optimal solution for a preference. Users can then select merged models tailored to their preferences from this learned Pareto set. Experimental results demonstrate that the proposed Pareto Merging produces diverse trade-off models and achieves higher test accuracy compared to state-of-the-art merging baselines.
Paperid:2677
Authors:Yuan Xin · Dingfan Chen · Michael Backes · Xiao Zhang
Abstract:
As machine learning models are deployed in critical applications, robustness against adversarial perturbations is crucial.While numerous defensive algorithms have been proposed to counter such attacks, they typically assume that all adversarial transformations are equally important, an assumption that rarely aligns with realworld applications.To address this, we study the problem of robust learning against adversarial perturbations under cost-sensitive scenarios, where the potential harm of different types of misclassifications is encoded in a cost matrix. Our solution introduces a provable robust learning algorithm to certify and optimize for cost-sensitive robustness, building on the scalable certification framework of randomized smoothing. Specifically, we formalize the definition of cost-sensitive certified radius, and propose our novel adaptation of the standard certification algorithm to generate tight robustness certificates tailored to any cost matrix. In addition, we design a robust training method that improves certified cost-sensitive robustness without compromising model accuracy.Extensive experiments on benchmark datasets, including challenging ones unsolvable by existing methods, demonstrate the effectiveness of our certification algorithm and training method across various cost-sensitive scenarios.
Paperid:2678
Authors:Gaoyue Zhou · Hengkai Pan · Yann LeCun · Lerrel Pinto
Abstract:
The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, remains challenging to learn and are typically developed for taskspecific solutions with online policy learning. To unlock world models' true potential, we argue that they should 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To this end, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic planning by treating goal features as prediction targets. We demonstrate that DINO-WM achieves zero-shot behavioral solutions at test time on six environments without expert demonstrations, reward modeling, or pre-learned inverse models, outperforming prior state-of-the-art work across diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.
Paperid:2679
Authors:Minghao Fu · Guo-Hua Wang · Liangfu Cao · Qing-Guo Chen · Zhao Xu · Weihua Luo · Kaifu Zhang
Abstract:
Diffusion models have emerged as a dominant approach for textto-image generation. Key components such as the human preference alignment and classifier-free guidance play a crucial role in ensuring generation quality. However, their independent application in current text-to-image models continues to face significant challenges in achieving strong text-image alignment, high generation quality, and consistency with human aesthetic standards. In this work, we for the first time, explore facilitating the collaboration of human performance alignment and test-time sampling to unlock the potential of text-to-image models. Consequently, we introduce CHATS (Combining Human-Aligned optimization and Test-time Sampling), a novel generative framework that separately models the preferred and dispreferred distributions and employs a proxy-prompt-based sampling strategy to utilize the useful information contained in both distributions. We observe that CHATS exhibits exceptional data efficiency, achieving strong performance with only a small, high-quality funetuning dataset. Extensive experiments demonstrate that CHATS surpasses traditional preference alignment methods, setting new state-of-the-art across various standard benchmarks.
Paperid:2680
Authors:Rickmer Schulte · David Rügamer · Thomas Nagler
Abstract:
There is growing interest in extending average treatment effect (ATE) estimation to incorporate nontabular data, such as images and text, which may act as sources of confounding. Neglecting these effects risks biased results and flawed scientific conclusions. However, non-tabular data cannot be handled using standard statistical methods, necessitating sophisticated feature extractors. In this work, we investigate how latent features from pre-trained neural networks can be leveraged to adjust for sources of confounding. We formalize conditions under which these latent features enable valid adjustment and statistical inference in ATE estimation, especially in the double machine learning context. We discuss critical challenges inherent to latent feature learning and downstream parameter estimation arising from the high dimensionality and non-identifiability of representations. Common structural assumptions for obtaining fast convergence rates with additive or sparse linear models are shown to be unrealistic for latent features. We argue, however, that neural networks are largely insensitive to these issues. In particular, we show that neural networks can achieve fast convergence rates by adapting to intrinsic notions of sparsity and dimension of the learning problem.
Paperid:2681
Authors:Qi Wang · Yuan Mi · Wang Haoyun · Yi Zhang · Ruizhi Chengze · Hongsheng Liu · Ji-Rong Wen · Hao Sun
Abstract:
Solving partial differential equations (PDEs) by numerical methods meet computational cost challenge for getting the accurate solution since fine grids and small time steps are required. Machine learning can accelerate this process, but struggle with weak generalizability, interpretability, and data dependency, as well as suffer in longterm prediction. To this end, we propose a PDE-embedded network with multiscale time stepping (MultiPDENet), which fuses the scheme of numerical methods and machine learning, for accelerated simulation of flows. In particular, we design a convolutional filter based on the structure of finite difference stencils with a small number of parameters to optimize, which estimates the equivalent form of spatial derivative on a coarse grid to minimize the equation's residual. A Physics Block with a 4th-order Runge-Kutta integrator at the fine time scale is established that embeds the structure of PDEs to guide the prediction. To alleviate the curse of temporal error accumulation in long-term prediction, we introduce a multiscale time integration approach, where a neural network is used to correct the prediction error at a coarse time scale. Experiments across various PDE systems, including the Navier-Stokes equations, demonstrate that MultiPDENet can accurately predict long-term spatiotemporal dynamics, even given small and incomplete training data, e.g., spatiotemporally down-sampled datasets. MultiPDENet achieves the state-of-the-art performance compared with other neural baseline models, also with clear speedup compared to classical numerical methods
Paperid:2682
Authors:David Vigouroux · Joseba Dalmau · Louis Béthune · Victor Boutin
Abstract:
Although Artificial Neural Networks (ANNs) have achieved remarkable success across various tasks, they still suffer from limited generalization. We hypothesize that this limitation arises from the traditional samplebased (0--dimensionnal) regularization used in ANNs. To overcome this, we introduce Deep Sturm-Liouville (DSL), a novel function approximator that enables continuous 1D regularization along field lines in the input space by integrating the Sturm-Liouville Theorem (SLT) into the deep learning framework. DSL defines field lines traversing the input space, along which a Sturm-Liouville problem is solved to generate orthogonal basis functions, enforcing implicit regularization thanks to the desirable properties of SLT. These basis functions are linearly combined to construct the DSL approximator. Both the vector field and basis functions are parameterized by neural networks and learned jointly. We demonstrate that the DSL formulation naturally arises when solving a Rank-1 Parabolic Eigenvalue Problem. DSL is trained efficiently using stochastic gradient descent via implicit differentiation and achieves competitive performance on diverse multivariate datasets, including high-dimensional image datasets such as MNIST and CIFAR-10.
Paperid:2683
Authors:Sathvik Chereddy · John Femiani
Abstract:
We present SketchDNN, a generative model for synthesizing CAD sketches by jointly modeling continuous parameters and discrete class labels through a continuousdiscrete diffusion process. A key innovation is our Gaussian-Softmax diffusion for discrete variables, which perturbs logits with Gaussian noise and projects them onto the probability simplex via a softmax transformation, enabling permutation-invariant denoising and superposition of class labels. This formulation addresses the challenges of heterogeneous primitive parameterizations and unordered primitives in CAD sketches. Our approach significantly improves generation quality, reducing Fréchet Inception Distance (FID) from 16.04 to 7.80 and negative log-likelihood (NLL) from 84.8 to 81.33, establishing a new state of the art in CAD sketch generation on the SketchGraphs dataset.
Paperid:2684
Authors:Weichen Li · Albert Jan · Baishakhi Ray · Junfeng Yang · Chengzhi Mao · Kexin Pei
Abstract:
Code editing is a foundational task in software development, where its effectiveness depends on whether it introduces desired code property changes without changing the original code's intended functionality. Existing approaches often formulate code editing as an implicit endto-end task, omitting the fact that code-editing procedures inherently consist of discrete and explicit steps, and thus suffer from suboptimal performance and lack of robustness and generalization. We introduce EditLord, a code editing framework that makes the code transformation steps explicit. Our key insight is to employ a language model (LM) as an inductive learner to extract code editing rules from the training code pairs as concise meta-rule sets.Such rule sets will be manifested for each training sample to augment them for finetuning or assist in prompting- and iterative-based code editing.EditLord outperforms the state-of-the-art by an average of 22.7% in editing performance and 58.1% in robustness while achieving 20.2% higher functional correctness, across critical software engineering and security applications, LM models, and editing modes.
Paperid:2685
Authors:zhemin Li · Hongxia Wang · Liyuan Ma · Yaoyun Zeng · 晓龙 韩
Abstract:
Implicit Neural Representations (INRs) rely heavily on architectural choices for good generalization. Developing theoretically grounded approaches for architecture design remains an active area of research. Via theoretical analysis of the infinitewidth limit, we establish a methodology that characterizes INR's generalization by means of kernel alignment. We first formulate the optimal kernel that minimizes pointwise expected squared error, then demonstrate that the Neural Tangent Kernel of the composed function (INR with input encoding) can approximate any positive semidefinite dot-product kernels through input feature mapping adjustments. Building upon these insights, we propose a Kernel Alignment Regularizer (KAR) that naturally integrates with existing INR systems to enhance kernel alignment. We further develop Plug-in Encoding for Aligned Kernels (PEAK) to refine INR models with KAR using learnable input encoding. This work contributes to the ongoing research efforts in bridging theory and practice for principled INR architecture design.
Paperid:2686
Authors:Filippo Lazzati · Alberto Maria Metelli
Abstract:
Abstract:Although it is wellknown that humans commonly engage in *risk-sensitive* behaviors in the presence of stochasticity, most Inverse Reinforcement Learning (IRL) models assume a *risk-neutral* agent. As such, beyond $(i)$ introducing model misspecification, $(ii)$ they do not permit direct inference of the risk attitude of the observed agent, which can be useful in many applications. In this paper, we propose a novel model of behavior to cope with these issues. By allowing for risk sensitivity, our model alleviates $(i)$, and by explicitly representing risk attitudes through (learnable) *utility* functions, it solves $(ii)$. Then, we characterize the partial identifiability of an agent’s utility under the new model and note that demonstrations from multiple environments mitigate the problem. We devise two provably-efficient algorithms for learning utilities in a finite-data regime, and we conclude with some proof-of-concept experiments to validate *both* our model and our algorithms.
Paperid:2687
Authors:Xiangqi Li · Libo Huang · Zhulin An · Weilun Feng · Chuanguang Yang · Boyu Diao · Fei Wang · Yongjun Xu
Abstract:
3D fewshot class incremental learning (FSCIL) aims to learn new point cloud categories from limited samples while preventing the forgetting of previously learned categories. This research area significantly enhances the capabilities of self-driving vehicles and computer vision systems. Existing 3D FSCIL approaches primarily utilize multimodal pre-trained models to extract the semantic features, heavily dependent on meticulously designed high-quality prompts and fine-tuning strategies. To reduce this dependence, this paper proposes a novel method for3DFSCILwithEmbeddedGeometric features (3D-FLEG). Specifically, 3D-FLEG develops a point cloudgeometric feature extraction moduleto capture category-related geometric characteristics. To address the modality heterogeneity issues that arise from integrating geometric and text features, 3D-FLEG introduces ageometric feature embedding module. By augmenting text prompts with spatial geometric features through these modules, 3D-FLEG can learn robust representations of new categories even with limited samples, while mitigating forgetting of the previously learned categories. Experiments conducted on several publicly available 3D point cloud datasets, including ModelNet, ShapeNet, ScanObjectNN, and CO3D, demonstrate 3D-FLEG's superiority over existing state-of-the-art 3D FSCIL methods.
Paperid:2688
Authors:Jiacheng Cheng · Xiwen Yao · Xiang Yuan · Junwei Han
Abstract:
The substantial computational demands of detection transformers (DETRs) hinder their deployment in resourceconstrained scenarios, with the encoder consistently emerging as a critical bottleneck. A promising solution lies in reducing token redundancy within the encoder. However, existing methods perform static sparsification while ignoring the varying importance of tokens across different levels and encoder blocks for object detection, leading to suboptimal sparsification and performance degradation. In this paper, we propose Dynamic DETR (Dynamic token aggregation for DEtection TRansformers), a novel strategy that leverages inherent importance distribution to control token density and performs multi-level token sparsification. Within each stage, we apply a proximal aggregation paradigm for low-level tokens to maintain spatial integrity, and a holistic strategy for high-level tokens to capture broader contextual information. Furthermore, we propose center-distance regularization to align the distribution of tokens throughout the sparsification process, thereby facilitating the representation consistency and effectively preserving critical object-specific patterns. Extensive experiments on canonical DETR models demonstrate that Dynamic DETR is broadly applicable across various models and consistently outperforms existing token sparsification methods.
Paperid:2689
Authors:Shuai Zhang · Chuan Zhou · Yang Liu · PENG ZHANG · Xixun Lin · Shirui Pan
Abstract:
Anomaly detection in continuoustime event sequences is an essential and challenging task in reliable and safety-critical applications. Existing approaches concentrate on developing a superior test statistic, but fail to provide guarantees on the false positive rate (FPR), thus lacking reliability in practical deployment. In this paper, we propose CADES (Conformal Anomaly Detection in Event Sequences), a novel test procedure based on conformal inference for the studied task with finite-sample FPR control. Specifically, by using the time-rescaling theorem, we design two powerful non-conformity scores tailored to event sequences, which exhibit complementary sensitivities to different abnormal patterns. CADES combines these scores with Bonferroni correction to leverage their respective strengths and addresses non-identifiability issues of existing approaches. Theoretically, we prove the validity of CADES and further provide strong guarantees on calibration-conditional FPR control. Experimental results on synthetic and real-world datasets, across various kinds of anomalies, demonstrate that CADES outperforms state-of-the-art methods while maintaining FPR control.
Paperid:2690
Authors:Jiayi Pan · Xingyao Wang · Graham Neubig · Navdeep Jaitly · Heng Ji · Alane Suhr · Yizhe Zhang
Abstract:
We present SWEGym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents , achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.
Paperid:2691
Authors:Xiaoyi Dong · Jian Cheng · Xi Zhang
Abstract:
The Soft ActorCritic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at https://github.com/diffusionyes/MaxEntDP.
Paperid:2692
Authors:Hansheng Chen · Kai Zhang · Hao Tan · Zexiang Xu · Fujun Luan · Leonidas Guibas · Gordon Wetzstein · Sai Bi
Abstract:
Abstract:Diffusion models approximate the denoising distribution as a Gaussian and predict its mean, whereas flow matching models reparameterize the Gaussian mean as flow velocity. However, they underperform in fewstep sampling due to discretization error and tend to produce over-saturated colors under classifier-free guidance (CFG).To address these limitations, we propose a novel Gaussian mixture flow matching (GMFlow) model: instead of predicting the mean, GMFlow predicts dynamic Gaussian mixture (GM) parameters to capture a multi-modal flow velocity distribution, which can be learned with a KL divergence loss.We demonstrate that GMFlow generalizes previous diffusion and flow matching models where a single Gaussian is learned with an $L_2$ denoising loss.For inference, we derive GM-SDE/ODE solvers that leverage analytic denoising distributions and velocity fields for precise few-step sampling. Furthermore, we introduce a novel probabilistic guidance scheme that mitigates the over-saturation issues of CFG and improves image generation quality.Extensive experiments demonstrate that GMFlow consistently outperforms flow matching baselines in generation quality, achieving a Precision of 0.942 with only 6 sampling steps on ImageNet 256 $\times$ 256. We will release the code upon publication.
Paperid:2693
Authors:Guancheng Wan · Zewen Liu · Xiaojun Shan · Max Lau · B. Aditya Prakash · Wei Jin
Abstract:
Effective epidemic forecasting is critical for public health strategies and efficient medical resource allocation, especially in the face of rapidly spreading infectious diseases. However, existing deeplearning methods often overlook the dynamic nature of epidemics and fail to account for the specific mechanisms of disease transmission. In response to these challenges, we introduce an innovative end-to-end framework called Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph (EARTH) in this paper. To learn continuous and regional disease transmission patterns, we first propose EANO, which seamlessly integrates the neural ODE approach with the epidemic mechanism, considering the complex spatial spread process during epidemic evolution. Additionally, we introduce GLTG to model global infection trends and leverage these signals to guide local transmission dynamically. To accommodate both the global coherence of epidemic trends and the local nuances of epidemic transmission patterns, we build a cross-attention approach to fuse the most meaningful information for forecasting. Through the smooth synergy of both components, EARTH offers a more robust and flexible approach to understanding and predicting the spread of infectious diseases. Extensive experiments show EARTH superior performance in forecasting real-world epidemics compared to state-of-the-art methods. The code is available at https://anonymous.4open.science/r/EARTH.
Paperid:2694
Authors:Simone Bombari · Marco Mondelli
Abstract:
Abstract:Learning models have been shown to rely on spurious correlations between nonpredictive features and the associated labels in the training data, with negative implications on robustness, bias and fairness.In this work, we provide a statistical characterization of this phenomenon for high-dimensional regression, when the data contains a predictive *core* feature $x$ and a *spurious* feature $y$. Specifically, we quantify the amount of spurious correlations $\mathcal C$ learned via linear regression, in terms of the data covariance and the strength $\lambda$ of the ridge regularization.As a consequence, we first capture the simplicity of $y$ through the spectrum of its covariance, and its correlation with $x$ through the Schur complement of the full data covariance. Next, we prove a trade-off between $\mathcal C$ and the in-distribution test loss $\mathcal L$, by showing that the value of $\lambda$ that minimizes $\mathcal L$ lies in an interval where $\mathcal C$ is increasing. Finally, we investigate the effects of over-parameterization via the random features model, by showing its equivalence to regularized linear regression.Our theoretical results are supported by numerical experiments on Gaussian, Color-MNIST, and CIFAR-10 datasets.
Paperid:2695
Authors:Ruinan Jin · Xiao Li · Yaoliang Yu · Baoxiang Wang
Abstract:
Adaptive moment estimation (Adam) is a cornerstone optimization algorithm in deep learning, widely recognized for its flexibility with adaptive learning rates and efficiency in handling largescale data. However, despite its practical success, the theoretical understanding of Adam's convergence has been constrained by stringent assumptions, such as almost surely bounded stochastic gradients or uniformly bounded gradients, which are more restrictive than those typically required for analyzing stochastic gradient descent (SGD).In this paper, we introduce a novel and comprehensive framework for analyzing the convergence properties of Adam. This framework offers a versatile approach to establishing Adam's convergence. Specifically, we prove that Adam achieves asymptotic (last iterate sense) convergence in both the almost sure sense and the (L_1) sense under the relaxed assumptions typically used for SGD, namely (L)-smoothness and the ABC inequality. Meanwhile, under the same assumptions, we show that Adam attains non-asymptotic sample complexity bounds similar to those of SGD.
Paperid:2696
Authors:Seokhun Park · Insung Kong · yongchan Choi · Chanmoo Park · Yongdai Kim
Abstract:
Interpretability for machine learning models is becoming more and more important as machine learning models become more complex. The functional ANOVA model, which decomposes a highdimensional function into a sum of lower dimensional functions (commonly referred to as components), is one of the most popular tools for interpretable AI, and recently, various neural networks have been developed for estimating each component in the functional ANOVA model. However, such neural networks are highly unstable when estimating each component since the components themselves are not uniquely defined. That is, there are multiple functional ANOVA decompositions for a given function. In this paper, we propose a novel neural network which guarantees a unique functional ANOVA decomposition and thus is able to estimate each component stably and accurately. We call our proposed neural network ANOVA Tensor Product Neural Network (ANOVA-TPNN) since it is motivated by the tensor product basis expansion.Theoretically, we prove that ANOVA-TPNN can approximate any smooth function well.Empirically, we show that ANOVA-TPNN provide much more stable estimation of each component and thus much more stable interpretation when training data and initial values of the model parameters vary than existing neural networks do.
Paperid:2697
Authors:Chungpa Lee · Sehee Lim · Kibok Lee · Jy-yong Sohn
Abstract:
Contrastive Learning (CL) has emerged as a powerful selfsupervised learning framework that learns representations by bringing similar views of the same instance (positive pairs) closer while pushing views of different instances (negative pairs) apart in the embedding space. This paper analyzes existing CL objectives through the lens of embedding pair similarities. In full-batch CL settings, we theoretically prove that when negative-pair similarities fall below a certain threshold, the perfect alignment of positive pairs cannot be achieved. We mathematically analyze that this misalignment issue, present in a class of CL losses, can be mitigated by incorporating within-view negative pairs that are typically overlooked. Furthermore, our analysis in mini-batch training scenarios reveals that negative pairs within the same mini-batch exhibit stronger separation compared to pairs across different batches, increasing the variance of negative-pair similarities as batch sizes decrease. To address this issue, we introduce the Variance-Reduction for Negative-pair similarity (VRN) loss term. Our empirical results show that integrating VRN into existing CL loss functions improves the performance of CL methods, especially in small-batch training scenarios.
Paperid:2698
Authors:Yilun Zhou · Austin Xu · PeiFeng Wang · Caiming Xiong · Shafiq Joty
Abstract:
Scaling testtime computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters). Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses. We will open source JETTS and maintain a leaderboard website after the anonymous review period.
Paperid:2699
Authors:Xuening Feng · Zhaohui Jiang · Timo Kaufmann · Eyke Hüllermeier · Paul Weng · Yifei Zhu
Abstract:
Learning human objectives from preference feedback has significantly advanced reinforcement learning (RL) in domains with hardto-formalize objectives. However, traditional methods based on pairwise trajectory comparisons face notable challenges, including the difficulty in comparing trajectories with subtle differences and the limitation of conveying only ordinal information, limiting direct inference of preference strength. In this paper, we introduce a novel distinguishability query, allowing humans to express preference strength by comparing two pairs of trajectories. Labelers first indicate which pair is easier to compare, then provide preference feedback only on the easier pair. Our proposed query type directly captures preference strength and is expected to reduce the cognitive load on the labeler. We further connect this query to cardinal utility and difference relations and develop an efficient query selection scheme to achieve better trade-off between query informativeness and easiness. Experimental results demonstrate the potential of our method for faster, data-efficient learning and improved user-friendliness in RLHF benchmarks, particularly in classical control settings where preference strength is critical for expected utility maximization.
Paperid:2700
Authors:Niclas Dern · John Cunningham · Geoff Pleiss
Abstract:
Classic ensembles generalize better than any single component model. In contrast, recent empirical studies find that modern ensembles of (overparameterized) neural networks may not provide any inherent generalization advantage over single but larger neural networks. This paper clarifies how modern overparameterized ensembles differ from their classic underparameterized counterparts, using ensembles of random feature (RF) regressors as a basis for developing theory. In contrast to the underparameterized regime, where ensembling typically induces regularization and increases generalization, we prove with minimal assumptions that infinite ensembles of overparameterized RF regressors become pointwise equivalent to (single) infinitewidth RF regressors, and finite width ensembles rapidly converge to single models with the same parameter budget. These results, which are exact for ridgeless models and approximate for small ridge penalties, imply that overparameterized ensembles and single large models exhibit nearly identical generalization. We further characterize the predictive variance amongst ensemble members, demonstrating that it quantifies the expected effects of increasing capacity rather than capturing any conventional notion of uncertainty. Our results challenge common assumptions about the advantages of ensembles in overparameterized settings, prompting a reconsideration of how well intuitions from underparameterized ensembles transfer to deep ensembles and the overparameterized regime.
Paperid:2701
Authors:Gaole Dai · Chun-Kai Fan · Yiming Tang · Zhi Zhang · Yuan Zhang · Yulu Gan · Qizhe Zhang · Cheng-Ching Tseng · Shanghang Zhang · Tiejun Huang
Abstract:
Advances in Parameterefficient Fine-tuning (PEFT) bridged the performance gap with Full Fine-Tuning (FFT) through sophisticated analysis of pre-trained parameter spaces. Starting from drawing insights from Neural Engrams (NE) in Biological Neural Networks (BNNs), we establish a connection between the low-rank property observed during PEFT's parameter space shifting and neurobiological mechanisms. This observation leads to our proposed method,Synapse andNeuron (SAN), which decomposes and propagates the scaling component from anterior feature adjustment vectors towards posterior weight matrices. Our approach is theoretically grounded in Long-Term Potentiation/Depression (LTP/D) phenomena, which govern synapse development through neurotransmitter release modulation. Extensive experiments demonstrate its effectiveness: onvision tasksacross VTAB, FGVC, and GIC (25 datasets) using ViT, Swin-T and ConvNeXt architectures, SAN outperforms FFT up to8.7%and LoRA by3.2%; onlanguage tasksusing Commonsense Reasoning (8 datasets) with LLaMA models (all generations), surpassing ChatGPT up to8.5%and LoRA by4.7%; onvision-language tasksusing Visual Instruction Tuning (7 datasets) with LLaVA models, it exceeds FFT up to2.4%and LoRA by1.9%. Our code and W&B log will be released
Paperid:2702
Authors:Zhichen Dong · Zhanhui Zhou · Zhixuan Liu · Chao Yang · Chaochao Lu
Abstract:
Abstract:In this work, we argue that large language models (LLMs), though trained to predict only the next token, exhibit emergent planning behaviors: $\textbf{their hidden representations encode future outputs beyond the next token}$. Through simple probing, we demonstrate that LLM prompt representations encode global attributes of their entire responses, including $\textit{structural attributes}$ (response length, reasoning steps), $\textit{content attributes}$ (character choices in storywriting, multiplechoice answers at the end of response), and $\textit{behavioral attributes}$ (answer confidence, factual consistency). In addition to identifying response planning, we explore how it scales with model size across tasks and how it evolves during generation. The findings that LLMs plan ahead for the future in their hidden representations suggests potential applications for improving transparency and generation control.
Paperid:2703
Authors:Hayden McTavish · Zachery Boner · Jon Donnelly · Margo Seltzer · Cynthia Rudin
Abstract:
Decision trees are widely used for interpretable machine learning due to their clearly structured reasoning process. However, this structure belies a challenge we refer to as predictive equivalence: a given tree's decision boundary can be represented by many different decision trees. The presence of models with identical decision boundaries but different evaluation processes makes model selection challenging. The models will have different variable importance and behave differently in the presence of missing values, but most optimization procedures will arbitrarily choose one such model to return. We present a boolean logical representation of decision trees that does not exhibit predictive equivalence and is faithful to the underlying decision boundary. We apply our representation to several downstream machine learning tasks. Using our representation, we show that decision trees are surprisingly robust to testtime missingness of feature values; we address predictive equivalence's impact on quantifying variable importance; and we present an algorithm to optimize the cost of reaching predictions.
Paperid:2704
Authors:Jiaxing Cui · Wei-Lin Chiang · Ion Stoica · Cho-Jui Hsieh
Abstract:
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of overrefusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that can elicit the over-refusal behaviors of LLMs. This study proposes a novel method for automatically generating large-scale over-refusal datasets. Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 over-refusal prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 32 popular LLMs across 8 model families. We hope this benchmark can help the community develop better safety aligned models.
Paperid:2705
Authors:David Clancy · Hanbaek Lyu · Sebastien Roch
Abstract:
Abstract:We consider the branchlength estimation problem on a bifurcating tree: a character evolves along the edges of a binary tree according to a two-state symmetric Markov process, and we seek to recover the edge transition probabilities from repeated observations of the leaves. This problem arises in phylogenetics, and is related to latent tree graphical model inference. In general, the log-likelihood function is non-concave and admits exponentially many critical points in the size of the tree. However, simple coordinate maximization has been known to perform very well in practice, defying the complexity of the likelihood landscape. In this work, we provide the first theoretical guarantee as to why this can be the case. We show that, provided with polynomially many $m$ samples, there exists a universal parameter regime (independent of the size and the topology of the tree) where the log-likelihood function is strongly concave and smooth with high probability. On this high-probability likelihood landscape event, we show that the standard coordinate maximization algorithm converges exponentially fast to the maximum likelihood estimator, which is within $O(1/\sqrt{m})$ from the true parameter.
Paperid:2706
Authors:Ruofeng Yang · Bo Jiang · Cheng Chen · Shuai Li
Abstract:
Abstract:Consistency models, a new class of onestep generative models, have shown competitive performance with multi-step diffusion models. The most challenging part of consistency models is the training process, which discretizes the continuous diffusion process into $K$ steps and trains a one-step mapping function on these discretized timepoints. Despite the empirical success, only a few works focus on the discretization complexity $K$, and their setting is far from that of empirical works. More specifically, the current theoretical works analyze the variance preserving (VP) diffusion process with a uniform stepsize, while empirical works adopt a variance exploding (VE) process with a decay discretization stepsize. As a result, these works suffer from large discretization complexity and fail to explain the empirical success of consistency models. To close the gap between theory and application, we analyze consistency models with (1) VE process and (2) decay stepsize and prove the state-of-the-art discretization complexity for consistency models. This result is competitive with the results of diffusion models and shows the potential of consistency models. To balance the computation and performance, \citet{song2023consistency} further propose a $2$-step consistency algorithm, which is widely adopted by empirical works. In this work, we also analyze the role of $2$-step sampling and show that it improves the discretization complexity compared with one-step generation.
Paperid:2707
Authors:Haofei Yu · Zhaochen Hong · Zirui Cheng · Kunlun Zhu · Keyang Xuan · Jinwei Yao · Tao Feng · Jiaxuan You
Abstract:
Large Language Models (LLMs) have demonstrated remarkable potential in scientific domains, yet a fundamental question remains unanswered: Can we simulate human research communities with LLMs? Addressing this question can deepen our understanding of the processes behind idea brainstorming and inspire the automatic discovery of novel scientific insights. In this work, we propose ResearchTown, a multiagent framework for research community simulation. Within this framework, the human research community is simplified as an agent-data graph, where researchers and papers are represented as agent-type and data-type nodes, respectively, and connected based on their collaboration relationships. We also introduce TextGNN, a text-based inference framework that models various research activities (e.g., paper reading, paper writing, and review writing) as special forms of a unified message-passing process on the agent-data graph. To evaluate the quality of the research simulation, we present ResearchBench, a benchmark that uses a node-masking prediction task for scalable and objective assessment based on similarity. Our experiments reveal three key findings: (1) ResearchTown can provide a realistic simulation of collaborative research activities, including paper writing and review writing; (2) ResearchTown can maintain robust simulation with multiple researchers and diverse papers; (3) ResearchTown can generate interdisciplinary research ideas that potentially inspire pioneering research directions.
Paperid:2708
Authors:Shuhai Zhang · Zeng You · Yaofo Chen · Zhiquan Wen · Qianyue Wang · Zhijie Qiu · Yuanqing Li · Mingkui Tan
Abstract:
Transformerbased large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to redundant attention computations: while attention weights are often sparse, all tokens consume equal computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a supervised learning task, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a group coding strategy, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose Dynamic Group Attention (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.
Paperid:2709
Authors:Freddie Bickford Smith · Jannik Kossen · Eleanor Trollope · Mark van der Wilk · Adam Foster · Tom Rainforth
Abstract:
The ideas of aleatoric and epistemic uncertainty are widely used to reason about the probabilistic predictions of machinelearning models. We identify incoherence in existing discussions of these ideas and suggest this stems from the aleatoric-epistemic view being insufficiently expressive to capture all of the distinct quantities that researchers are interested in. To address this we present a decision-theoretic view on prediction, from which we derive rigorous notions of model-based uncertainty and statistical dispersion in data. Using this in place of the aleatoric-epistemic view could produce clearer discourse as the field moves forward. On top of this we provide insights into popular information-theoretic quantities, showing they can be poor estimators of what they are often purported to measure, while also explaining how they can still be useful in guiding data acquisition.
Paperid:2710
Authors:Peer Nagy · Sascha Frey · Kang Li · Bidipta Sarkar · Svitlana Vyetrenko · Stefan Zohren · Anisoara Calinescu · Jakob Foerster
Abstract:
While financial data presents one of the most challenging and interesting sequence modelling tasks due to high noise, heavy tails, and strategic interactions, progress in this area has been hindered by the lack of consensus on quantitative evaluation paradigms. To address this, we presentLOBBench, a benchmark, implemented in python, designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB) in the LOBSTER format. Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data, supporting flexible multivariate statistical evaluation. The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times, along with scores from a trained discriminator network. Lastly, LOB-Bench contains "market impact metrics", i.e. the cross-correlations and price response functions for specific events in the data. We benchmark generative autoregressive state-space models, a (C)GAN, as well as a parametric LOB model and find that the autoregressive GenAI approach beats traditional model classes.
Paperid:2711
Authors:Jongmin Lee · Mario Bravo · Roberto Cominetti
Abstract:
Abstract:We study a new modelfree algorithm to compute $\varepsilon$-optimal policies for average reward Markov decision processes, in the weakly communicating case. Given a generative model, our procedure combines a recursive sampling technique with Halpern's anchored iteration, and computes an $\varepsilon$-optimal policy with sample and time complexity $\widetilde{O}(|\mathcal{S}||\mathcal{A}|||h^*||^{2}/\varepsilon^{2})$ both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor $||h^*||$. Although the complexity bound involves the span seminorm $||h^*||$ of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze how these techniques can be adapted for discounted MDPs.
Paperid:2712
Authors:Łukasz Struski · Michal B Bednarczyk · Igor Podolak · Jacek Tabor
Abstract:
Abstract:We present a novel technique for constructing differentiable ordertype operations, including soft ranking, soft top-k selection, and soft permutations. Our approach leverages an efficient closed-form formula for the inverse of the function LapSum, defined as the sum of Laplace distributions. This formulation ensures low computational and memory complexity in selecting the highest activations, enabling losses and gradients to be computed in $O(n \log n)$ time. Through extensive experiments, we demonstrate that our method outperforms state-of-the-art techniques for high-dimensional vectors and large $k$ values. Furthermore, we provide efficient implementations for both CPU and CUDA environments, underscoring the practicality and scalability of our method for large-scale ranking and differentiable ordering problems.
Paperid:2713
Authors:Xu Zhang · Haoye Qiu · Weixuan Liang · Hui LIU · Junhui Hou · Yuheng Jia
Abstract:
Abstract:Ensemble clustering has demonstrated significant success in practical applications; however, its theoretical foundations remain underexplored. This paper examines the generalization performance of ensemble clustering, focusing on generalization error, excess risk and consistency. We derive a convergence rate of generalization error bound and excess risk bound both of $\mathcal{O}(\sqrt{\frac{\log n}{m}}+\frac{1}{\sqrt{n}})$, with $n$ and $m$ being the numbers of samples and base clusterings. Based on this, we prove that when $m$ and $n$ approach infinity and $m$ is significantly larger than log $n$, i.e., $m,n\to \infty, m\gg \log n$, ensemble clustering is consistent. Furthermore, recognizing that $n$ and $m$ are finite in practice, the generalization error cannot be reduced to zero. Thus, by assigning varying weights to these finite clusterings, we minimize the error between the empirical average clusterings and the expectation of clustering. From this, we theoretically demonstrate that to achieve better clustering performance, we should minimize the deviation (bias) of base clustering from its expectation and maximize the differences (diversity) among various base clusterings. Additionally, we derive that maximizing diversity is nearly equivalent to a robust (minmax) optimization model. Finally, we instantiate our theory to develop a new ensemble clustering algorithm. Compared with SOTA methods, our approach achieves average improvements of 6.1%, 7.3%, and 6.0% on 10 datasets w.r.t. NMI, ARI, and Purity metrics. **The code is included in the Supplementary Material**.
Paperid:2714
Authors:Bishoy Galoaa · Somaieh Amraee · Sarah Ostadabbas
Abstract:
This paper introduces MOTE (MOre Than meets the Eye), a novel multiobject tracking (MOT) algorithm designed to address the challenges of tracking occluded objects. By integrating deformable detection transformers with a custom disocclusion matrix, MOTE significantly enhances the ability to track objects even when they are temporarily hidden from view. The algorithm leverages optical flow to generate features that are processed through a softmax splatting layer, which aids in the creation of a disocclusion matrix. This matrix plays a crucial role in maintaining track consistency by estimating the motion of occluded objects. MOTE's architecture includes modifications to the enhanced track embedding module (ETEM), which allows it to incorporate these advanced features into the track query layer embeddings. This integration ensures that the model not only tracks visible objects but also accurately predicts the trajectories of occluded ones, much like the human visual system. The proposed method is evaluated on multiple datasets, including MOT17, MOT20, and DanceTrack, where it achieves impressive tracking metrics--82.0 MOTA and 66.3 HOTA on the MOT17 dataset, 81.7 MOTA and 65.8 HOTA on the MOT20 dataset, and 93.2 MOTA and 74.2 HOTA on the DanceTrack dataset. Notably, MOTE excels in reducing identity switches and maintaining consistent tracking in complex real-world scenarios with frequent occlusions, outperforming existing state-of-the-art methods across all tested benchmarks.
Paperid:2715
Authors:Han Li · Fei Liu · Zhi Zheng · Yu Zhang · Zhenkun Wang
Abstract:
Vehicle routing problems (VRPs) are significant combinatorial optimization (CO) problems holding substantial practical importance. Recently, Neural combinatorial optimization (NCO), which involves training deep learning models on extensive data to learn vehicle routing heuristics, has emerged as a promising approach due to its efficiency and the reduced need for manual algorithm design. Current crossproblem NCO methods for VRPs typically employ a constraint-unaware model, limiting their cross-problem performance. Furthermore, they rely solely on global connectivity, which fails to focus on key nodes and leads to inefficient representation learning. This paper introduces a \underline{C}onstraint-\underline{A}ware \underline{D}ual-\underline{A}ttention Model (CaDA), designed to address these limitations. CaDA incorporates a constraint prompt that efficiently represents different problem variants. Additionally, it features a dual-attention mechanism with a global branch for capturing broader graph-wide information and a sparse branch that selectively focuses on the key node connections. We comprehensively evaluate our model on 16 different VRPs and compare its performance against existing cross-problem VRP solvers. CaDA achieves state-of-the-art results across all tested VRPs. Our ablation study confirms that each component contributes to its cross-problem learning performance.
Paperid:2716
Authors:Ziyu Ye · Rishabh Agarwal · Tianqi Liu · Rishabh Joshi · Sarmishta Velury · Quoc Le · Qijun Tan · Yuan Liu
Abstract:
Existing RL posttraining for large language models (LLMs) relies on a pre-curated, static prompt distribution, which bottlenecks scalability. Prior works have explored prompt evolving, but are often limited to the SFT stage, and prompts are sampled and evolveduniformlywithout effective signals. Thisempirical workpresents a paradigm shift: Evolving Alignment via Asymmetric Self-Play (eva), that casts post-training as aninfinite gamewithreward-basedregret objectives for 2 players: (i) a creator, who strategically samples and creates new informative prompts and (ii) a solver, who learns to produce preferred responses.evais the first technique that integrates seamlessly inbothonline and offline RL post-training, simply built with aprioritized generative buffer. The design is easy to use yet remarkably effective:evasets a new SOTA on challenging benchmarks, without any extra human prompts,e.g.it boosts the win rate of gemma-2-9b-it on Arena-Hard by 51.6% → 60.1% for DPO and 52.6% → 62.4% for RLOO, surpassing claude-3-opus and catching up to gemini-1.5-pro, both of which are orders of magnitude larger.Extensive experiments showevacan create effective RL curricula and is robust across ablations. We believeadaptively evolvingprompts are key to design next-generation RL post-training at scale.
Paperid:2717
Authors:Chao Ma · Wenbo Gong · Meyer Scetbon · Edward Meeds
Abstract:
Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that preprocessing SGD in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while drastically reducing the memory cost. Specifically, we propose to pre-process the instantaneous stochastic gradients using normalization and whitening. We show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN has the same memory footprint as SGD, achieving ≈50% reduction on total end-to-end memory compared to Adam. In language modeling tasks, SWAN demonstrates comparable or even better performance than Adam: when pre-training the LLaMA model with 350M and 1.3B parameters, SWAN reaches the same evaluation perplexity using half as many tokens (2x speedup).
Paperid:2718
Authors:Joe Suk · Jung-hun Kim
Abstract:
We study an infinitearmed bandit problem where actions' mean rewards are initially sampled from areservoir distribution.Most prior works in this setting focused on stationary rewards (Berry et al., 1997; Wang et al., 2008; Bonald and Proutiere, 2013; Carpentier and Valko, 2015) with the more challenging adversarial/non-stationary variant only recently studied in the context of rotting/decreasing rewards (Kim et al., 2022; 2024). Furthermore, optimal regret upper bounds were only achieved using parameter knowledge of non-stationarity and only known for certain regimes of regularity of the reservoir. This work shows the first parameter-free optimal regret bounds for all regimes while also relaxing distributional assumptions on the reservoir.We first introduce a blackbox scheme to convert a finite-armed MAB algorithm designed for near-stationary environments into a parameter-free algorithm for the infinite-armed non-stationary problem with optimal regret guarantees. We next study a natural notion ofsignificant shiftfor this problem inspired by recent developments in finite-armed MAB (Suk & Kpotufe, 2022). We show that tighter regret bounds in terms of significant shifts can be adaptively attained by employing a randomized variant of elimination within our blackbox scheme. Our enhanced rates only depend on the rotting non-stationarity and thus exhibit an interesting phenomenon for this problem where rising rewards do not factor into the difficulty of non-stationarity.
Paperid:2719
Authors:Alina Shutova · Vladimir Malinovskii · Ivan Ermakov · Surkov Nikita · Denis Mazur · Denis Kuznedelev · Vage Egiazarian · Dan Alistarh
Abstract:
Abstract:Efficient realworld deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key \& Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) high-compression mechanisms for internal network states. We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to ``optimally'' compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under $1\%$ relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.
Paperid:2720
Authors:Gianluigi Silvestri · Luca Ambrogioni · Chieh-Hsin Lai · Yuhta Takida · Yuki Mitsufuji
Abstract:
Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, nondistillation consistency training often suffers from high variance and instability, and analyzing and improving its training dynamics is an active area of research. In this work, we propose a novel CT training approach based on the Flow Matching framework. Our main contribution is a trained noise-coupling scheme inspired by the architecture of Variational Autoencoders (VAE). By training a data-dependent noise emission model implemented as an encoder architecture, our method can indirectly learn the geometry of the noise-to-data mapping, which is instead fixed by the choice of the forward process in classical CT. Empirical results across diverse image datasets show significant generative improvements, with our model outperforming baselines and achieving the state-of-the-art (SoTA) non-distillation CT FID on CIFAR-10, and attaining FID on par with SoTA on ImageNet at 64x64 resolution in 2-step generation.
Paperid:2721
Authors:Kuan Li · Liwen Zhang · Yong Jiang · Pengjun Xie · Fei Huang · Shuai Wang · Minhao Cheng
Abstract:
Effectively incorporating external knowledge into Large Language Models (LLMs) is crucial for enhancing their capabilities and addressing realworld needs. Retrieval-Augmented Generation (RAG) offers an effective method for achieving this by retrieving the most relevant fragments into LLMs. However, the advancements in context window size for LLMs offer an alternative approach, raising the question of whether RAG remains necessary for effectively handling external knowledge. Several existing studies provide inconclusive comparisons between RAG and long-context (LC) LLMs, largely due to limitations in the benchmark designs. In this paper, we present LaRA, a novel benchmark specifically designed to rigorously compare RAG and LC LLMs. LaRA encompasses 2,326 test cases across four practical QA task categories and three types of naturally occurring long texts. Through systematic evaluation of seven open-source and four proprietary LLMs, we find that the optimal choice between RAG and LC depends on a complex interplay of factors, including the model's parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks. Our findings provide actionable guidelines for practitioners to effectively leverage both RAG and LC approaches in developing and deploying LLM applications.
Paperid:2722
Authors:Zhongnian Li · Jinghao Xu · Peng Ying · Meng Wei · Xinzheng Xu
Abstract:
Pretrained Vision-Language Models (VLMs) exhibit strong zero-shot classification abilities, demonstrating great potential for generating weakly supervised labels. Unfortunately, existing weakly supervised learning methods are short of ability in generating reliable labels via VLMs. In this paper, we propose a novel weakly supervised labeling setting, namely True-False Labels (TFLs) which can achieve high accuracy when generated by VLMs. The TFL indicates whether an instance belongs to the label, which is randomly and uniformly sampled from the candidate label set. Specifically, we theoretically derive a risk-consistent estimator to explore and utilize the conditional probability distribution information of TFLs. Besides, we propose a convolutional-based Multi-modal Prompt Retrieving (MRP) method to bridge the gap between the knowledge of VLMs and target learning tasks. Experimental results demonstrate the effectiveness of the proposed TFL setting and MRP learning method. Our code will be available soon.
Paperid:2723
Authors:Keyue Qiu · Yuxuan Song · Jie Yu · Hongbo Ma · Ziyao Cao · Zhilong Zhang · Yushuai Wu · Mingyue Zheng · Hao Zhou · Wei-Ying Ma
Abstract:
Structurebased molecule optimization (SBMO) aims to optimize molecules with both continuous coordinates and discrete types against protein targets.A promising direction is to exert gradient guidance on generative models given its remarkable success in images, but it is challenging to guide discrete data and risks inconsistencies between modalities.To this end, we leverage a continuous and differentiable space derived through Bayesian inference, presenting Molecule Joint Optimization (MolJO), the gradient-based SBMO framework that facilitates joint guidance signals across different modalities while preserving SE(3)-equivariance.We introduce a novel backward correction strategy that optimizes within a sliding window of the past histories, allowing for a seamless trade-off between explore-and-exploit during optimization.MolJO achieves state-of-the-art performance on CrossDocked2020 benchmark (Success Rate 51.3\%, Vina Dock -9.05 and SA 0.78), more than 4x improvement in Success Rate compared to the gradient-based counterpart, and 2x ``Me-Better'' Ratio as much as 3D baselines. Furthermore, we extend MolJO to a wide range of optimization settings, including multi-objective optimization and challenging tasks in drug design such as R-group optimization and scaffold hopping, further underscoring its versatility.
Paperid:2724
Authors:Xingpei Ma · Jiaran Cai · Yuansheng Guan · Shenneng Huang · Qiang Zhang · Shunsi Zhang
Abstract:
Recent diffusionbased talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate outperforms existing state-of-the-art methods in terms of video quality and lip-synchronization, and improves flexibility in controlling emotion and head pose. The code will be available at https://playmate111.github.io.
Paperid:2725
Authors:Xinyang Li · Siqi Liu · Bochao Zou · Jiansheng Chen · Huimin Ma
Abstract:
As large language models evolve, there is growing anticipation that they will emulate humanlike Theory of Mind (ToM) to assist with routine tasks. However, existing methods for evaluating machine ToM focus primarily on unimodal models and largely treat these models as black boxes, lacking an interpretative exploration of their internal mechanisms. In response, this study adopts an approach based on internal mechanisms to provide an interpretability-driven assessment of ToM in multimodal large language models (MLLMs). Specifically, we first construct a multimodal ToM test dataset, GridToM, which incorporates diverse belief testing tasks and perceptual information from multiple perspectives. Next, our analysis shows that attention heads in multimodal large models can distinguish cognitive information across perspectives, providing evidence of ToM capabilities. Furthermore, we present a lightweight, training-free approach that significantly enhances the model’s exhibited ToM by adjusting in the direction of the attention head.
Paperid:2726
Authors:Taehyun Cho · Seungyub Han · Seokhun Ju · Dohyeong Kim · Kyungjae Lee · Jungwoo Lee
Abstract:
Abstract:Distributional reinforcement learning improves performance by capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive.In addition, the intractable element of the infinite dimensionality of distributions has been overlooked.In this paper, we present a regret analysis of distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of Bellman unbiasedness which is essential for exactly learnable and provably efficient distributional updates in an online manner.Among all types of statistical functionals for representing infinitedimensional return distributions, our theoretical results demonstrate that only moment functionals can exactly capture the statistical information.Secondly, we propose a provably efficient algorithm, SF-LSVI, that achieves a tight regret bound of $\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number of episodes, and $d_E$ is the eluder dimension of a function class.
Paperid:2727
Authors:Zihao Wang · Yibo Jiang · Jiahao Yu · Heqing Huang
Abstract:
Large language models (LLMs) that integrate multiple input roles (e.g., system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role—a concept we callrole separation—is crucial for consistent multirole behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examinerole-separation learning: the process of teaching LLMs to robustly distinguish system and user tokens. Through asimple, controlled experimental framework, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation can partially mitigate these shortcuts, it generally leads to iterative patching rather than a deeper fix. To address this, we propose reinforcinginvariant signalsthat mark role boundaries by adjusting token-wise cues in the model's input encoding. In particular, modifying position IDs helps the model learn clearer distinctions and reduces reliance on superficial proxies. By focusing on this mechanism-centered perspective, our work illuminates how LLMs can more reliably maintain consistent multi-role behavior without merely memorizing known prompts or triggers.
Paperid:2728
Authors:Omayma Mahjoub · Sasha Abramowitz · Ruan de Kock · Wiem Khlifi · Simon Du Toit · Jemma Daniel · Louay Nessir · Juan Formanek · Louise Beyers · Liam Clark · Arnu Pretorius
Abstract:
As multiagent reinforcement learning (MARL) progresses towards solving larger and more complex problems, it becomes increasingly important that algorithms exhibit the key properties of (1) strong performance, (2) memory efficiency and (3) scalability. In this work, we introduce Sable, a performant, memory efficient and scalable sequence modeling approach to MARL. Sable works by adapting the retention mechanism in Retentive Networks (Sun et al., 2023) to achieve computationally efficient processing of multi-agent observations with long context memory for temporal reasoning. Through extensive evaluations across six diverse environments, we demonstrate how Sable is able to significantly outperform existing state-of-the-art methods in a large number of diverse tasks (34 out of 45 tested). Furthermore, Sable maintains performance as we scale the number of agents, handling environments with more than a thousand agents while exhibiting a linear increase in memory usage. Finally, we conduct ablation studies to isolate the source of Sable's performance gains and confirm its efficient computational memory usage.All experimental data and code is available at: https://sites.google.com/view/sable-marl.
Paperid:2729
Authors:Mohamad Al Ahdab · john leth · Zheng-Hua Tan
Abstract:
We study the ContinuousDiscrete Kalman Filter (CD-KF) in Bayesian State-Space Models (SSMs) where continuous-time dynamics are observed via multiple sensors with discrete, irregularly timed measurements. Our focus extends to scenarios in which the measurement process is coupled with the states of an auxiliary state-space model. For instance, higher measurement rates may increase energy consumption or heat generation, while a sensor’s accuracy can depend on its own spacial trajectory or that of the measured target. Each sensor thus carries distinct costs and constraints associated with its measurement rate, as well as additional constraints and costs on the auxiliary state. We model measurement occurrences as independent Poisson processes with sensor-specific rates and derive an upper bound on the mean posterior covariance matrix of the CD-KF, which is continuously differentiable with respect to these rates. This differentiability enables efficient gradient-based optimization. Exploiting this bound, we propose a finite-horizon optimal control framework to jointly optimize measurement rates, auxiliary-state dynamics, and covariance constraints. We further introduce a deterministic method for scheduling actual measurement times from the optimized rates. Empirical results in state-space filtering and dynamic temporal Gaussian process regression demonstrate that our approach achieves improved trade-offs between resource usage and estimation accuracy, underscoring its practical effectiveness.
Paperid:2730
Authors:Zhihan Zhu · Yanhao Zhang · Yong Xia
Abstract:
This paper introduces two novel criteria: one for feature selection and another for feature elimination in the context of best subset selection, which is a benchmark problem in statistics and machine learning. From the perspective of optimization, we revisit the classical selection and elimination criteria in traditional best subset selection algorithms, revealing that these classical criteria capture only partial variations of the objective function after the entry or exit of features. By formulating and solving optimization subproblems for feature entry and exit exactly, new selection and elimination criteria are proposed, proved as the optimal decisions for the current entryand-exit process compared to classical criteria. Replacing the classical selection and elimination criteria with the proposed ones generates a series of enhanced best subset selection algorithms. These generated algorithms not only preserve the theoretical properties of the original algorithms but also achieve significant meta-gains without increasing computational cost across various scenarios and evaluation metrics on multiple tasks such as compressed sensing and sparse regression.
Paperid:2731
Authors:Rafał Karczewski · Markus Heinonen · Vikas Garg
Abstract:
Diffusion models have emerged as a powerful class of generative models, capable of producing highquality images by mapping noise to a data distribution. However, recent findings suggest that image likelihood does not align with perceptual quality: high-likelihood samples tend to be smooth, while lower-likelihood ones are more detailed. Controlling sample density is thus crucial for balancing realism and detail. In this paper, we analyze an existing technique, Prior Guidance, which scales the latent code to influence image detail. We introduce score alignment, a condition that explains why this method works and show that it can be tractably checked for any continuous normalizing flow model. We then propose Density Guidance, a principled modification of the generative ODE that enables exact log-density control during sampling. Finally, we extend Density Guidance to stochastic sampling, ensuring precise log-density control while allowing controlled variation in structure or fine details. Our experiments demonstrate that these techniques provide fine-grained control over image detail without compromising sample quality.
Paperid:2732
Authors:Yuecheng Li · Lele Fu · Tong Wang · Jian Lou · Bin Chen · Lei Yang · Jian Shen · Zibin Zheng · Chuan Chen
Abstract:
Abstract:To defend against privacy leakage of user data, differential privacy is widely used in federated learning, but it is not free. The addition of noise randomly disrupts the semantic integrity of the model and this disturbance accumulates with increased communication rounds. In this paper, we introduce a novel federated learning framework with rigorous privacy guarantees, named **FedCEO**, designed to strike a tradeoff between model utility and user privacy by letting clients "***C**ollaborate with **E**ach **O**ther*". Specifically, we perform efficient tensor low-rank proximal optimization on stacked local model parameters at the server, demonstrating its capability to flexibly truncate high-frequency components in spectral space. This capability implies that our FedCEO can effectively recover the disrupted semantic information by smoothing the global semantic space for different privacy settings and continuous training processes. Moreover, we improve the SOTA utility-privacy trade-off bound by order of $\sqrt{d}$, where $d$ is the input dimension. We illustrate our theoretical results with experiments on representative image datasets and observe significant performance improvements and strict privacy guarantees under different privacy settings. Our **code** is available in the supplementary material.
Paperid:2733
Authors:Kayo Yin · Jacob Steinhardt
Abstract:
Large language models (LLMs) exhibit impressive incontext learning (ICL) capability, enabling them to generate relevant responses from a handful of task demonstrations in the prompt. Prior studies have suggested two different explanations for the mechanisms behind ICL:induction heads that find and copy relevant tokens, and function vector (FV) heads whose activations compute a latent encoding of the ICL task.To better understand which of the two distinct mechanisms drives ICL, we study and compare induction heads and FV heads in 12 language models. Through detailed ablations, we find that few-shot ICL is driven primarily by FV heads, especially in larger models. We also find that FV and induction heads are connected: many FV headsstart as induction heads during training before transitioning to the FV mechanism. This leads us to speculate that induction facilitates learning the more complex FV mechanism for ICL.
Paperid:2734
Authors:Nicholas Goldowsky-Dill · Bilal Chughtai · Stefan Heimersheim · Marius Hobbhahn
Abstract:
AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while their internal reasoning is misaligned. We thus evaluate if linear probes can robustly detect deception by monitoring model activations. We test two probetraining datasets, one with contrasting instructions to be honest or deceptive (following Zou et al., 2023) and one of responses to simple roleplaying scenarios. We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively, such as concealing insider trading (Scheurer, 2023) and purposefully underperforming on safety evaluations (Benton, 2024). We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1\% false positive rate on chat data not related to deception, our probe catches 95-99\% of the deceptive responses. Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception. Our probes' outputs can be viewed at REDACTED and our code at REDACTED.
Paperid:2735
Authors:Qi Zhou · Tianlin Li · Qing Guo · Dongxia Wang · Yun Lin · Yang Liu · Jin Song Dong
Abstract:
Recent studies have raised significant concerns regarding the vulnerability of Large Vision Language Models (LVLMs) to maliciously injected or perturbed input images, which can mislead their responses. Existing defense methods show that such vision attacks are sensitive to image modifications especially cropping, using majority voting across responses of modified images as corrected responses. However, these modifications often result in partial images and distort the semantics, which reduces response quality on clean images after voting. Instead of directly using responses from partial images for voting, we investigate using them to supervise the LVLM's responses to the original images. We propose a blackbox, training-free method called DPS (Defense through Partial-Perception Supervision. In this approach, the model is prompted using the responses generated by a model that perceives only a partial image.With DPS, the model can adjust its response based on partial image understanding when under attack, while confidently maintaining its original response for clean input. Our findings show that the weak model can supervise the strong model: when faced with an attacked input, the strong model becomes less confident and adjusts its response based on the weak model’s partial understanding, effectively defending against the attack. With clean input, it confidently maintains its original response. Empirical experiments show our method outperforms the baseline, cutting the average attack success rate by 76.3% across six datasets on three popular models.
Paperid:2736
Authors:YUANYUAN YAO · Yuan Dong · Lu Chen · Kun Kuang · Ziquan Fang · Cheng Long · Yunjun Gao · TIANYI LI
Abstract:
Current causal discovery methods for time series data can effectively address a variety of scenarios; however, they remain constrained by inefficiencies. The significant inefficiencies arise primarily from the high computational costs associated with binning, the uncertainty in selecting appropriate time lags, and the extensive sets of candidate variables. To achieve both high efficiency and effectiveness of casual discovery, we introduce an accelerator termed Arrow. It incorporates an innovative concept termed “Time Weaving” that efficiently encodes time series data to well capture the dynamic trends, thereby mitigating computational complexity. We also propose a novel time lag discovery strategy utilizing XOR operations, which derives a theorem to obtain the optimal time lag and significantly enhances the efficiency using XOR operations. To optimize the search space for causal relationships, we design an efficient pruning strategy that intelligently identifies the most relevant candidate variables, enhancing the efficiency and accuracy of causal discovery. We applied Arrow to four different types of time series causal discovery algorithms and evaluated it on 25 synthetic and realworld datasets. The results demonstrate that, compared to the original algorithms, Arrow achieves up to 153x speedup while achieving higher accuracy in most cases. The code is available at https://anonymous.4open.science/r/arrow.
Paperid:2737
Authors:Weiyu Guo · Ziyue Qiao · Ying Sun · Yijie Xu · Hui Xiong
Abstract:
Gesture recognition based on surface electromyography (sEMG) has been gaining importance in many 3D Interactive Scenes. However, sEMG is easily influenced by various forms of noise in realworld environments, leading to challenges in providing long-term stable interactions through sEMG. Existing methods often struggle to enhance model noise resilience through various predefined data augmentation techniques.In this work, we revisit the problem from a short-term enhancement perspective to improve precision and robustness against various common noisy scenarios with learnable denoise using sEMG intrinsic pattern information and sliding-window attention. We propose a Short Term Enhancement Module(STEM), which can be easily integrated with various models. STEM offers several benefits: 1) Noise-resistant, enhanced robustness against noise without manual data augmentation; 2) Adaptability, adaptable to various models; and 3) Inference efficiency, achieving short-term enhancement through minimal weight-sharing in an efficient attention mechanism.In particular, we incorporate STEM into a transformer, creating the Short-Term Enhanced Transformer (STET).Compared with best-competing approaches, the impact of noise on STET is reduced by more than 20\%. We report promising results on classification and regression tasks and demonstrate that STEM generalizes across different gesture recognition tasks. The code is available at https://anonymous.4open.science/r/shorttermsemg.
Paperid:2738
Authors:Yu Zhang · Shanshan Zhao · Bokui Wan · Jinjuan Wang · Xiaodong Yan
Abstract:
Detecting a minor average treatment effect is a major challenge in largescale applications, where even minimal improvements can have a significant economic impact. Traditional methods, reliant on normal distribution-based or expanded statistics, often fail to identify such minor effects because of their inability to handle small discrepancies with sufficient sensitivity. This work leverages a counterfactual outcome framework and proposes a maximum probability-driven two-armed bandit process by weighting the mean volatility statistic, which controls Type I error. The implementation of permutation methods further enhances the robustness and efficacy. The established strategic central limit theorem (SCLT) demonstrates that our approach yields a more concentrated distribution under the null hypothesis and a less concentrated one under the alternative hypothesis, greatly improving statistical power. The experimental results indicate a significant improvement in the A/B testing, highlighting the potential to reduce experimental costs while maintaining high statistical power.
Paperid:2739
Authors:Adrià López Escoriza · Nicklas Hansen · Stone Tao · Tongzhou Mu · Hao Su
Abstract:
Longhorizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable sub-goals. In this work, we propose DEMO³, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult taskscompared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.
Paperid:2740
Authors:Muyang Cao · Jiajun Yu · Xin Du · Gang Pan · Wei Wang
Abstract:
Abstract:The Metric Nearness Problem involves restoring a nonmetric matrix to its closest metric-compliant form, addressing issues such as noise, missing values, and data inconsistencies. Ensuring metric properties, particularly the $O(N^3)$ triangle inequality constraints, presents significant computational challenges, especially in large-scale scenarios where traditional methods suffer from high time and space complexity. We propose a novel solution based on the tropical inner product (max-plus operation), which we prove satisfies the triangle inequality for non-negative real matrices. By transforming the problem into a continuous optimization task, our method directly minimizes the distance to the target matrix using the RMSProp algorithm. This approach not only restores metric properties but also generates metric-preserving embeddings, enabling real-time updates and reducing computational and storage overhead for downstream tasks. Experimental results demonstrate that our method achieves up to 1000× speed improvements over state-of-the-art approaches, and efficiently scales from $1e4 \times 1e4$ to $1e5 \times 1e5$ matrices with significantly lower memory usage.
Paperid:2741
Authors:Ignavier Ng · Yan Li · Zijian Li · Yujia Zheng · Guangyi Chen · Kun Zhang
Abstract:
A central problem in unsupervised domain adaptation is determining what to transfer from labeled source domains to an unlabeled target domain. To handle highdimensional observations (e.g., images), a line of approaches use deep learning to learn latent representations of the observations, which facilitate knowledge transfer in the latent space. However, existing approaches often rely on restrictive assumptions to establish identifiability of the joint distribution in the target domain, such as independent latent variables or invariant label distributions, limiting their real-world applicability. In this work, we propose a general domain adaptation framework that learns compact latent representations to capture distribution shifts relative to the prediction task and address the fundamental question of what representations should be learned and transferred. Notably, we first demonstrate that learning representations based on all the predictive information, i.e., the label's Markov blanket in terms of the learned representations, is often underspecified in general settings. Instead, we show that, interestingly, general domain adaptation can be achieved by partitioning the representations of Markov blanket into those of the label's parents, children, and spouses. Moreover, its identifiability guarantee can be established. Building on these theoretical insights, we develop a practical, nonparametric approach for domain adaptation in a general setting, which can handle different types of distribution shifts.
Paperid:2742
Authors:Zhengrui Guo · Qichen Sun · Jiabo MA · Lishuang Feng · Jinzhuo Wang · Hao Chen
Abstract:
Whole slide image (WSI) analysis presents significant computational challenges due to the massive number of patches in gigapixel images. While transformer architectures excel at modeling longrange correlations through self-attention, their quadratic computational complexity makes them impractical for computational pathology applications. Existing solutions like local-global or linear self-attention reduce computational costs but compromise the strong modeling capabilities of full self-attention. In this work, we proposeQuerent,i.e., thequery-awarelong contextual dynamic modeling framework, which maintains the expressive power of full self-attention while achieving practical efficiency. Our method adaptively predicts which surrounding regions are most relevant for each patch, enabling focused yet unrestricted attention computation only with potentially important contexts. By using efficient region-wise metadata computation and importance estimation, our approach dramatically reduces computational overhead while preserving global perception to model fine-grained patch correlations. Through comprehensive experiments on biomarker prediction, gene mutation prediction, cancer subtyping, and survival analysis across over 10 WSI datasets, our method demonstrates superior performance compared to the state-of-the-art approaches.
Paperid:2743
Authors:Fanmeng Wang · Wentao Guo · Qi Ou · Hongshuai Wang · Haitao Lin · Hongteng Xu · Zhifeng Gao
Abstract:
Polymer conformation generation is a critical task that enables atomiclevel studies of diverse polymer materials. While significant advances have been made in conformation generation methods for small molecules and proteins, these methods struggle to generate polymer conformations due to polymers' unique structural characteristics.The scarcity of polymer conformation datasets further limits the progress of this unexplored research area.In this work, we propose PolyConf, a pioneering tailored polymer conformation generation method that leverages hierarchical generative models to unlock new possibilities for this task.Specifically, we decompose the polymer conformation into a series of local conformations (i.e., the conformations of its repeating units), generating these local conformations through an autoregressive model. We then generate corresponding orientation transformations via a diffusion model to assemble these local conformations into the complete polymer conformation.Moreover, we develop the first benchmark with a high-quality polymer conformation dataset derived from molecular dynamics simulations to boost related research in this area.Comprehensive evaluation demonstrates that PolyConf consistently generates high-quality, physically reliable polymer conformations, facilitating advancements in polymer modeling and simulation.
Paperid:2744
Authors:Hongyi Zhou · Josiah Hanna · Jin Zhu · Ying Yang · Chengchun Shi
Abstract:
This paper studies offpolicy evaluation (OPE) in reinforcement learning with a focus on behavior policy estimation for importance sampling. Prior work has shown empirically that estimating a history-dependent behavior policy can lead to lower mean squared error (MSE) even when the true behavior policy is Markovian. However, the question ofwhythe use of history should lower MSE remains open. In this paper, we theoretically demystify this paradox by deriving a bias-variance decomposition of the MSE of ordinary importance sampling (IS) estimators, demonstrating that history-dependent behavior policy estimation decreases their asymptotic variances while increasing their finite-sample biases. Additionally, as the estimated behavior policy conditions on a longer history, we show a consistent decrease in variance. We extend these findings to a range of other OPE estimators, including the sequential IS estimator, the doubly robust estimator and the marginalized IS estimator, with the behavior policy estimated either parametrically or non-parametrically.
Paperid:2745
Authors:Leif Döring · Benedikt Wille · Maximilian Birr · Mihail Bîrsan · Martin Slowik
Abstract:
Bias problems in the estimation of Qvalues are a well-known obstacle that slows down convergence of Q-learning and actor-critic methods. One of the reasons of the success of modern RL algorithms is partially a direct or indirect overestimation reduction mechanism. We introduce an easy to implement method built on top of distributional reinforcement learning (DRL) algorithms to deal with the overestimation in a locally adaptive way. Our framework ADDQ is simple to implement, existing DRL implementations can be improved with a few lines of code. We provide theoretical backup and experimental results in tabular, Atari, and MuJoCo environments, comparisons with state-of-the-art methods, and a proof of convergence in the tabular case.
Paperid:2746
Authors:Mohammed Adnan · Rohan Jain · Ekansh Sharma · Rahul G. Krishnan · Yani Ioannou
Abstract:
The Lottery Ticket Hypothesis (LTH) suggests there exists a sparse LTH mask and weights thatachieve the same generalization performance as the dense model while using significantly fewer parameters. However, finding a LTH solution is computationally expensive, and a LTH’s sparsity mask does not generalize to other random weight initializations. Recent work has suggested that neural networks trained from random initialization find solutions within the same basin modulo permutation, and proposes a method to align trained models within the same loss basin. We hypothesize that misalignment of basins is the reason why LTH masks do not generalize to new random initializations and propose permuting the LTH mask to align with the new optimization basin when performing sparse training from a different random init. We empirically show a significant increase in generalization when sparse training from random initialization with the permuted mask as compared to using the nonpermuted LTH mask, on multiple datasets (CIFAR-10/100 & ImageNet) and models (VGG11 & ResNet20/50).
Paperid:2747
Authors:Xinsong Ma · Xin Zou · Weiwei Liu
Abstract:
Outof-distribution (OOD) detection task is significant in reliable and safety-critical applications. Existing approaches primarily focus on developing the powerful score function, but overlook the design of decision-making rules based on these score function. In contrast to prior studies, we rethink the OOD detection task from an perspective of online multiple hypothesis testing. We then propose a novel generalized LOND (g-LOND) algorithm to solve the above problem. Theoretically, the g-LOND algorithm controls false discovery rate (FDR) at pre-specified level without the consideration for the dependence between the p-values. Furthermore, we prove that the false positive rate (FPR) of the g-LOND algorithm converges to zero in probability based on the generalized Gaussian-like distribution family. Finally, the extensive experimental results verify the effectiveness of g-LOND algorithm for OOD detection.
Paperid:2748
Authors:Gaozheng Pei · Ke Ma · Yingfei Sun · Qianqian Xu · Qingming Huang
Abstract:
The diffusionbased adversarial purification methods attempt to drown adversarial perturbations into a part of isotropic noise through the forward process, and then recover the clean images through the reverse process. Due to the lack of distribution information about adversarial perturbations in the pixel domain, it is often unavoidable to damage normal semantics. We turn to the frequency domain perspective, decomposing the image into amplitude spectrum and phase spectrum. We find that for both spectra, the damage caused by adversarial perturbations tends to increase monotonically with frequency. This means that we can extract the content and structural information of the original clean sample from the frequency components that are less damaged. Meanwhile, theoretical analysis indicates that existing purification methods indiscriminately damage all frequency components, leading to excessive damage to the image. Therefore, we propose a purification method that can eliminate adversarial perturbations while maximizing the preservation of the content and structure of the original image. Specifically, at each time step during the reverse process, for the amplitude spectrum, we replace the low-frequency components of the estimated image's amplitude spectrum with the corresponding parts of the adversarial image.For the phase spectrum, we project the phase of the estimated image into a designated range of the adversarial image's phase spectrum, focusing on the low frequencies. Empirical evidence from extensive experiments demonstrates that our method significantly outperforms most current defense methods. Code: \url{https://anonymous.4open.science/r/FreqPure}.
Paperid:2749
Authors:Ali Ebrahimpour-Boroojeny · Hari Sundaram · Varun Chandrasekaran
Abstract:
Machine unlearning, where users can request the deletion of a forget dataset, is becoming increasingly important because of numerous privacy regulations. Initial works on "exact'' unlearning (e.g., retraining) incur large computational overheads. However, while computationally inexpensive, "approximate'' methods have fallen short of reaching the effectiveness of exact unlearning: models produced fail to obtain comparable accuracy and prediction confidence on both the forget and test (i.e., unseen) dataset. Exploiting this observation, we propose a new unlearning method, Adversarial Machine UNlearning (AMUN), that outperforms prior stateof-the-art (SOTA) methods for image classification. AMUN lowers the confidence of the model on the forget samples by fine-tuning the model on their corresponding adversarial examples. Adversarial examples naturally belong to the distribution imposed by the model on the input space; fine-tuning the model on the adversarial examples closest to the corresponding forget samples (a) localizes the changes to the decision boundary of the model around each forget sample and (b) avoids drastic changes to the global behavior of the model, thereby preserving the model's accuracy on test samples. Using AMUN for unlearning a random 10% of CIFAR-10 samples, we observe that even SOTA membership inference attacks cannot do better than random guessing.
Paperid:2750
Authors:Wenjun Zhang · Liangxiao Jiang · Chaoqun Li
Abstract:
Label completion serves as a preprocessing approach to handling the sparse crowdsourced label matrix problem, significantly boosting the effectiveness of the downstream label aggregation. In recent advances, worker modeling has been proved to be a powerful method to further improve the performance of label completion. However, in realworld scenarios, workers typically annotate only a few instances, leading to insufficient worker modeling and thus limiting the improvement of label completion. To address this issue, we propose a novel transfer learning-based label completion (TLLC) algorithm. Specifically, we first identify all high-confidence instances from the whole crowdsourced data as a source domain and use it to pretrain a Siamese network. The abundant annotated instances in the source domain provide essential knowledge for worker modeling. Then, we transfer the pretrained network to the target domain with the instances annotated by each worker separately, ensuring worker modeling captures unique characteristics of each worker. Finally, we leverage the new embeddings learned by the transferred network to complete each worker’s missing labels. Extensive experiments on different real-world datasets demonstrate the effectiveness of TLLC.
Paperid:2751
Authors:Cong Hua · Qianqian Xu · Zhiyong Yang · Zitai Wang · Shilong Bao · Qingming Huang
Abstract:
Abstract:Prompt tuning adapts VisionLanguage Models like CLIP to open-world tasks with minimal training costs. In this direction, one typical paradigm evaluates model performance **separately** on known classes (*i.e.*, base domain) and unseen classes (*i.e.*, new domain). However, real-world scenarios require models to handle inputs **without prior domain knowledge**. This practical challenge has spurred the development of **open-world prompt tuning**, which demands a unified evaluation of two stages: 1) detecting whether an input belongs to the base or new domain (**P1**), and 2) classifying the sample into its correct class (**P2**). What's more, as domain distributions are generally unknown, a proper metric should be insensitive to varying base/new sample ratios (**P3**). However, we find that current metrics, including HM, overall accuracy, and AUROC, fail to satisfy these three properties simultaneously. To bridge this gap, we propose $\mathsf{OpenworldAUC}$, a unified metric that jointly assesses detection and classification through pairwise instance comparisons. To optimize $\mathsf{OpenworldAUC}$ effectively, we introduce **Gated Mixture-of-Prompts (GMoP)**, which employs domain-specific prompts and a gating mechanism to dynamically balance detection and classification. Theoretical guarantees ensure generalization of GMoP under practical conditions. Experiments on 15 benchmarks in open-world scenarios show GMoP achieves SOTA performance on $\mathsf{OpenworldAUC}$ and other metrics.
Paperid:2752
Authors:Charlotte Peale · Vinod Raman · Omer Reingold
Abstract:
We introduce "representative generation," extending the theoretical framework for generation proposed by Kleinberg et al. (2024) and formalized by Li et al. (2024), to additionally address diversity and bias concerns in generative models. Our notion requires outputs of a generative model to proportionally represent groups of interest from the training data. We characterize representative uniform and nonuniform generation, introducing the ``group closure dimension'' as a key combinatorial quantity. For representative generation in the limit, we analyze both information-theoretic and computational aspects, demonstrating feasibility for countably infinite hypothesis classes and collections of groups under certain conditions, but proving a negative result for computability using only membership queries. This contrasts with Kleinberg et al.'s (2024) positive results for standard generation in the limit. Our findings provide a rigorous foundation for developing more diverse and representative generative models.
Paperid:2753
Authors:Jing Xiong · Jianghan Shen · Chuanyang Zheng · Zhongwei Wan · Chenyang Zhao · Chiwun Yang · Fanghua Ye · Hongxia Yang · Lingpeng Kong · Ngai Wong
Abstract:
Efficiently handling long contexts is crucial for large language models (LLMs). While rotary position embeddings (RoPEs) enhance length generalization, effective length extrapolation remains challenging and often requires costly finetuning. In contrast, recent training-free approaches suffer from the \textit{attention sink} phenomenon, leading to severe performance degradation. In this paper, we introduceParallelComp, a novel training-free method for long-context extrapolation that extends LLMs' context length from 4K to 128K while maintaining high throughput and preserving perplexity, and integrates seamlessly with Flash Attention. Our analysis offers new insights into attention biases in parallel attention mechanisms and provides practical solutions to tackle these challenges. To mitigate the attention sink issue, we propose anattention calibration strategythat reduces biases, ensuring more stable long-range attention. Additionally, we introduce achunk eviction strategyto efficiently manage ultra-long contexts on a single A100 80GB GPU. To further enhance efficiency, we propose aparallel KV cache evictiontechnique, which improves chunk throughput by 1.76×, thereby achieving a 23.50× acceleration in the prefilling stage with negligible performance loss due to attention calibration. Furthermore,ParallelCompachieves91.17\% of GPT-4's performanceon long-context tasks using an 8B model trained on 8K-length context, outperforming closed-source models such as Claude-2 and Kimi-Chat.
Paperid:2754
Authors:Meng-Hao Guo · Jiajun Xu · Yi Zhang · Jiaxi Song · Haoyang Peng · Yi-Xuan Deng · Xinzhi Dong · Kiyohiro Nakayama · Zhengyang Geng · Chen Wang · Bolin Ni · Guo-Wei Yang · Yongming Rao · Houwen Peng · Han Hu · Gordon Wetzstein · Shi-min Hu
Abstract:
Reasoning stands as a cornerstone of intelligence, enabling the synthesis of existing knowledge to solve complex problems. Despite remarkable progress, existing reasoning benchmarks often fail to rigorously evaluate the nuanced reasoning capabilities required for complex, realworld problemsolving, particularly in multi-disciplinary and multimodal contexts. In this paper, we introduce a graduate-level, multi-disciplinary, English-Chinese benchmark, dubbed as Reasoning Bench (R-Bench), for assessing the reasoning capability of both language and multimodal models. RBench spans 1,094 questions across 108 subjects for language model evaluation and 665 questions across 83 subjects for multimodal model testing in both English and Chinese. These questions are meticulously curated to ensure rigorous difficulty calibration, subject balance, and crosslinguistic alignment, enabling the assessment is an olympiad-level multi-disciplinary benchmark. We evaluate widely used models, including OpenAI o1, GPT-4o, DeepSeek-R1, etc. Experimental results indicate that advanced models perform poorly on complex reasoning, especially multimodal reasoning. Even the top-performing model OpenAI o1 achieves only 53.2% accuracy on our multimodal evaluation. Data and code will be made publicly available.
Paperid:2755
Authors:Zijian Liu · Zhengyuan Zhou
Abstract:
Abstract:We study the convergence of the shuffling gradient method, a popular algorithm employed to minimize the finitesum function with regularization, in which functions are passed to apply (Proximal) Gradient Descent (GD) one by one whose order is determined by a permutation on indices of functions. In contrast to its easy implementation and effective performance in practice, the theoretical understanding remains limited. A recent advance by (Liu & Zhou, 2024b) establishes the first last-iterate convergence results under various settings, especially proving the optimal rates for smooth (strongly) convex optimization. However, their bounds for nonsmooth (strongly) convex functions are only as fast as Proximal GD. In this work, we provide the first improved last-iterate analysis for the nonsmooth case demonstrating that the widely used Random Reshuffle ($\textsf{RR}$) and Single Shuffle ($\textsf{SS}$) strategies are both provably faster than Proximal GD, reflecting the benefit of randomness. As an important implication, we give the first (nearly) optimal convergence result for the suffix average under the $\textsf{RR}$ sampling scheme in the general convex case, matching the lower bound shown by (Koren et al., 2022).
Paperid:2756
Authors:Katharina Prasse · Patrick Knab · Sascha Marton · Christian Bartelt · Margret Keuper
Abstract:
Concept Bottleneck Models (CBMs) enhance the interpretability of neural networks by basing predictions on humanunderstandable concepts. However, current CBMs typically rely on concept sets extracted from large language models or extensive image corpora, limiting their effectiveness in data-sparse scenarios. We propose Data-efficient CBMs (DCBMs), which reduce the need for large sample sizes during concept generation while preserving interpretability. DCBMs define concepts as image regions detected by segmentation or detection foundation models, allowing each image to generate multiple concepts across different granularities.This removes reliance on textual descriptions and large-scale pre-training, making DCBMs applicable for fine-grained classification and out-of-distribution tasks. Attribution analysis using Grad-CAM demonstrates that DCBMs deliver visual concepts that can be localized in test images. By leveraging dataset-specific concepts instead of predefined ones, DCBMs enhance adaptability to new domains.
Paperid:2757
Authors:Zitong Shi · Guancheng Wan · Guibin Zhang · Wenke Huang · He Li · Carl Yang · Mang Ye
Abstract:
Abstract:Federated Graph Learning (FGL) has gained significant attention as a privacypreserving approach to collaborative learning, but the computational demands increase substantially as datasets grow and Graph Neural Network (GNN) layers deepen. To address these challenges, we propose $\textbf{EAGLES}$, a unified sparsification framework. EAGLES applies client-consensus parameter sparsification to generate multiple unbiased subnetworks at varying sparsity levels, reducing the need for iterative adjustments and mitigating performance degradation. In the graph structure domain, we introduced a dual-expert approach: a $\textit{graph sparsification expert}$ uses multi-criteria node-level sparsification, and a $\textit{graph synergy expert}$ integrates contextual node information to produce optimal sparse subgraphs. Furthermore, the framework introduces a novel distance metric that leverages node contextual information to measure structural similarity among clients, fostering effective knowledge sharing. We also introduce the $\textbf{Harmony Sparsification Principle}$, EAGLES balances model performance with lightweight graph and model structures. Extensive experiments demonstrate its superiority, achieving competitive performance on various datasets, such as reducing training FLOPS by 82\% $\downarrow$ and communication costs by 80\% $\downarrow$ on the ogbn-proteins dataset, while maintaining high performance.
Paperid:2758
Authors:Rahul Sharma · Sumantrak Mukherjee · Andrea Šipka · Eyke Hüllermeier · Sebastian Vollmer · Sergey Redyuk · David A Selby
Abstract:
Explainable AI (XAI) and interpretable machine learning methods help to build trust in model predictions and derived insights, yet also present a perverse incentive for analysts to manipulate XAI metrics to support prespecified conclusions. This paper introduces the concept of X-hacking, a form of p-hacking applied to XAI metrics such as Shap values. We show how easily an automated machine learning pipeline can be adapted to exploit model multiplicity at scale: searching a set of ‘defensible’ models with similar predictive performance to find a desired explanation. We formulate the trade-off between explanation and accuracy as a multi-objective optimisation problem, and illustrate empirically on familiar real-world datasets that, on average, Bayesian optimisation accelerates X-hacking 3-fold for features susceptible to it, versus random sampling. We show the vulnerability of a dataset to X-hacking can be determined by information redundancy among features. Finally, we suggest possible methods for detection and prevention, and discuss ethical implications for the credibility and reproducibility of XAI.
Paperid:2759
Authors:Nicolas Lell · Ansgar Scherp
Abstract:
Shallow node embeddings like node2vec (N2V) can be used for nodes without features or to supplement existing features with structurebased information.Embedding methods like N2V are limited in their application on new nodes or graphs because their approach to embed nodes is based on random walks.This limits them to the transductive setting where the entire graph, including the test nodes, needs to be available during training.We propose inductive node2vec (iN2V), which combines a post-hoc procedure to compute embeddings for nodes unseen during training and modifications to the original N2V training procedure to prepare the embeddings for this post-hoc procedure.We conduct experiments on several benchmark datasets.Our results demonstrate that iN2V is an effective approach to bringing otherwise transductive embeddings to an inductive setting.Using these embeddings, we improve node classification by up to 5 points, depending on the dataset and the number of unseen nodes.Our iN2V is a plug-in approach to creating new or enriching existing embeddings. It can also be combined with other embedding methods, making it a versatile approach for inductive node representation learning. Code to reproduce the results is available on GitHub (supplementary material during review).
Paperid:2760
Authors:Chao Xu · Xijia Tang · Chenping Hou
Abstract:
In openenvironment applications, data are often collected from heterogeneous modalities with distinct encodings, resulting in feature space heterogeneity. This heterogeneity inherently induces label shift, making cross-modal knowledge transfer particularly challenging when the source and target data exhibit simultaneous heterogeneous feature spaces and shifted label distributions. Existing studies address only partial aspects of this issue, leaving the broader problem unresolved. To bridge this gap, we introduce a new concept of Heterogeneous Label Shift (HLS), targeting this critical but underexplored challenge. We first analyze the impact of heterogeneous feature spaces and label distribution shifts on model generalization and introduce a novel error decomposition theorem. Based on these insights, we propose a bound minimization HLS framework that decouples and tackles feature heterogeneity and label shift accordingly. Extensive experiments on various benchmarks for cross-modal classification validate the effectiveness and practical relevance of the proposed approach.
Paperid:2761
Authors:Minh Vu · Geigh Zollicoffer · Huy Mai · Ben Nebgen · Boian S Alexandrov · Manish Bhattarai
Abstract:
Multimodal Machine Learning systems, particularly those aligning text and image data like CLIP/BLIP models, have become increasingly prevalent, yet remain susceptible to adversarial attacks. While substantial research has addressed adversarial robustness in unimodal contexts, defense strategies for multimodal systems are underexplored. This work investigates the topological signatures that arise between image and text embeddings and shows how adversarial attacks disrupt their alignment, introducing distinctive signatures. We specifically leverage persistent homology and introduce two novel TopologicalContrastive losses based on Total Persistence and Multi-scale kernel methods to analyze the topological signatures introduced by adversarial perturbations. We observe a pattern of monotonic changes in the proposed topological losses emerging in a wide range of attacks on image-text alignments, as more adversarial samples are introduced in the data. By designing an algorithm to back-propagate these signatures to input samples, we are able to integrate these signatures into Maximum Mean Discrepancy tests, creating a novel class of tests that leverage topological signatures for better adversarial detection.
Paperid:2762
Authors:Yujin Oh · Pengfei Jin · Sangjoon Park · Sekeun Kim · Siyeop yoon · Jin Kim · Kyungsang Kim · Xiang Li · Quanzheng Li
Abstract:
Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distributionaware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE's role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code will be made available.
Paperid:2763
Authors:Valentin De Bortoli · Alexandre Galashov · Arthur Gretton · Arnaud Doucet
Abstract:
Speculative sampling is a popular technique for accelerating inference in Large Language Models by generating candidate tokens using a fast draft model and accepting or rejecting them based on the target model’s distribution. While speculative sampling was previously limited to discrete sequences, we extend it to diffusion models, which generate samples via continuous, vectorvalued Markov chains. In this context, the target model is a high-quality but computationally expensive diffusion model. We propose various drafting strategies, including a simple and effective approach that does not require training a draft model and is applicable out of the box to any diffusion model. Our experiments demonstrate significant generation speedup on various diffusion models, halving the number of function evaluations, while generating exact samples from the target model.
Paperid:2764
Authors:M Ganesh Kumar · Blake Bordelon · Jacob A Zavatone-Veth · Cengiz Pehlevan
Abstract:
When rodents learn to navigate in a novel environment, a high density of place fields emerges at reward locations, fields elongate against the trajectory, and individual fields change spatial selectivity while demonstrating stable behavior. Why place fields demonstrate these characteristic phenomena during learning remains elusive. We develop a normative framework using a reward maximization objective, whereby the temporal difference (TD) error drives place field reorganization to improve policy learning. Place fields are modeled using Gaussian radial basis functions to represent states in an environment, and directly synapse to an actorcritic for policy learning. Each field's amplitude, center, and width, as well as downstream weights, are updated online at each time step to maximize cumulative reward. We demonstrate that this framework unifies three disparate phenomena observed in navigation experiments. Furthermore, we show that these place field phenomena improve policy convergence when learning to navigate to a single target and relearning multiple new targets. To conclude, we develop a normative model that recapitulates several aspects of hippocampal place field learning dynamics and unifies mechanisms to offer testable predictions for future experiments.
Paperid:2765
Authors:Zehong Wang · Zheyuan Zhang · Tianyi MA · Nitesh Chawla · Chuxu Zhang · Yanfang Ye
Abstract:
Foundation models aim to create general, crosstask, and cross-domain machine learning models by pretraining on large-scale datasets to capture shared patterns or concepts, such as contours, colors, textures, and edges in images, or tokens, words, and sentences in text. However, identifying generalities across graph-structured data remains a significant challenge, as different graph-based tasks necessitate distinct inductive biases, thereby impeding the development of graph foundation models. To address this challenge, we introduce a novel approach for learning cross-task generalities in graphs. Specifically, we propose task-trees as basic learning instances to align task spaces (node, link, graph) on graphs. Then, we conduct a theoretical analysis to examine their stability, transferability, and generalization. Our findings indicate that when a graph neural network (GNN) is pretrained on diverse task-trees using a reconstruction objective, it acquires transferable knowledge, enabling effective adaptation to downstream tasks with an appropriate set of fine-tuning samples. To empirically validate this approach, we develop a pretrained graph model based on task-trees, termed Graph Generality Identifier on Task-Trees (GIT). Extensive experiments demonstrate that a single pretrained GIT model can be effectively adapted to over 30 different graphs across five domains via fine-tuning, in-context learning, or zero-shot learning.
Paperid:2766
Authors:Jingruo Sun · Wenzhi Gao · Ellen Vitercik · Yinyu Ye
Abstract:
Abstract:Online linear programming (OLP) has found broad applications in revenue management and resource allocation. Stateof-the-art OLP algorithms achieve low regret by repeatedly solving linear programming (LP) subproblems that incorporate updated resource information. However, LP-based methods are computationally expensive and often inefficient for large-scale applications. By contrast, recent first-order OLP algorithms are more computationally efficient but typically suffer from worse regret guarantees. To address these shortcomings, we propose a new algorithm that combines the strengths of LP-based and first-order OLP methods. The algorithm re-solves the LP subproblems periodically at a predefined frequency $f$ and uses the latest dual prices to guide online decision-making. In parallel, a first-order method runs during each interval between LP re-solves and smooths resource consumption. Our algorithm achieves $\mathcal{O}(\log (T/f) + \sqrt{f})$ regret and delivers a "wait-less" online decision-making process that balances computational efficiency and superior regret. Extensive experiments demonstrate at least $20$-fold improvements in regret over pure first-order methods and $100$-fold improvements in runtime over pure LP-based methods.
Paperid:2767
Authors:Xintao Wang · Heng Wang · Yifei Zhang · Rui Xu · Jen-Tse Huang · Xinfeng Yuan · Haoran Guo · Siyu Yuan · Jiangjie Chen · Wei Wang · Yanghua Xiao · Shuchang Zhou
Abstract:
Roleplaying language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively. We release our code, dataset and models for future studies.
Paperid:2768
Authors:Yuchen Zhu · Daniel Augusto de Souza · Zhengyan Shi · Mengyue Yang · Pasquale Minervini · Alexander D'Amour · Matt Kusner
Abstract:
We address the problem of reward hacking, where maximising a proxy reward does not necessarily increase the true reward. This is a key concern for Large Language Models (LLMs), as they are often finetuned on human preferences that may not accurately reflect a true objective. Existing work uses various tricks such as regularisation, tweaks to the reward model, and reward hacking detectors, to limit the influence that such proxy preferences have on a model. Luckily, in many contexts such as medicine, education, and law, a sparse amount of expert data is often available. In these cases, it is often unclear whether the addition of proxy data can improve policy learning. We outline a set of sufficient conditions on proxy feedback that, if satisfied, indicate that proxy data can provably improve the sample complexity of learning the ground truth policy. These conditions can inform the data collection process for specific tasks. The result implies a parameterisation for LLMs that achieves this improved sample complexity. We detail how one can adapt existing architectures to yield this improved sample complexity.
Paperid:2769
Authors:Zhenni Bi · Kai Han · Chuanjian Liu · Yehui Tang · Yunhe Wang
Abstract:
Large Language Models (LLMs) have demonstrated remarkable abilities across various language tasks, but solving complex reasoning problems remains a significant challenge. While existing methods, such as Chainof-Thought (CoT) and Tree-of-Thought (ToT), enhance reasoning by decomposing problems or structuring prompts, they typically perform a single pass of reasoning and may fail to revisit flawed paths, compromising accuracy. To address this limitation, we propose a novel reasoning framework called Forest-of-Thought (FoT), which integrates multiple reasoning trees to leverage collective decision-making for solving complex logical problems. FoT employs sparse activation strategies to select the most relevant reasoning paths, improving both efficiency and accuracy. Additionally, we introduce a dynamic self-correction strategy that enables real-time error correction, along with consensus-guided decision-making strategies to optimize both correctness and computational resources. Experimental results demonstrate that the FoT framework, combined with these strategies, significantly enhances the reasoning capabilities of LLMs, enabling them to solve complex tasks with greater precision and efficiency.
Paperid:2770
Authors:Hanmo Liu · Shimin Di · Haoyang LI · Xun Jian · Yue Wang · Lei Chen
Abstract:
Node classification is a key task in temporal graph learning (TGL). Reallife temporal graphs often introduce new node classes over time, but existing TGL methods assume a fixed set of classes. This assumption brings limitations, as updating models with full data is costly, while focusing only on new classes results in forgetting old ones. Graph continual learning (GCL) methods mitigate forgetting using old-class subsets but fail to account for their evolution. We define this novel problem as temporal graph continual learning (TGCL), which focuses on efficiently maintaining up-to-date knowledge of old classes. To tackle TGCL, we propose a selective learning framework that substitutes the old-class data with its subsets, Learning Towards the Future (LTF). We derive an upper bound on the error caused by such replacement and transform it into objectives for selecting and learning subsets that minimize classification error while preserving the distribution of the full old-class data. Experiments on three real-world datasets show that LTF effectively addresses the TGCL challenge.
Paperid:2771
Authors:Kento Nishi · Rahul Ramesh · Maya Okawa · Mikail Khona · Hidenori Tanaka · Ekdeep Singh Lubana
Abstract:
Knowledge Editing (KE) algorithms alter models' weights to perform targeted updates to incorrect, outdated, or otherwise unwanted factual associations. However, recent work has shown that applying KE can adversely affect models' broader factual recall accuracy and diminish their reasoning abilities. Although these studies give insights into the potential harms of KE algorithms, e.g., performance evaluations on benchmarks, little is understood about why such destructive failures occur. Motivated by this, we define a novel synthetic task in which a Transformer is trained from scratch to internalize a "structured" knowledge graph. The structure enforces relationships between entities of the graph, such that editing a factual association has "trickling effects" on other entities (e.g., altering X's parent is Y to Z affects who X's siblings' parent is). Through evaluations of edited models on this task, we show that KE inadvertently affects representations of entities beyond the targeted one, distorting relevant structures that allow a model to infer unseen knowledge about an entity. We call this phenomenon representation shattering and demonstrate that it degrades models' factual recall and reasoning performance. We further corroborate our findings in naturalistic settings with pretrained Llama and Mamba models as well. Overall, our work yields a precise mechanistic hypothesis to explain why KE has adverse effects on model abilities.
Paperid:2772
Authors:David Hayden · Mao Ye · Timur Garipov · Gregory Meyer · Carl Vondrick · Zhao Chen · Yuning Chai · Eric M. Wolff · Siddhartha Srinivasa
Abstract:
It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a \emph{reactive}, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general modelbased longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on image classification benchmarks, and can be analyzed to proactively discover, explain, and address conceptual gaps in a predictive model.
Paperid:2773
Authors:Haoyu Wang · Zeyu Qin · Li Shen · Xueqian Wang · Minhao Cheng · Dacheng Tao
Abstract:
Training safe LLMs is one of the most critical research challenge. However, the commonly used method, Refusal Training (RT), struggles to generalize against various OOD jailbreaking attacks. Many safety training methods have been proposed to address this issue. While they offer valuable insights, we aim to complement this line of research by investigating whether OOD attacks truly exceed the capability of RT model. Conducting evaluation with BoN, we observe significant improvements on generalization as N increases. This underscores that the model possesses sufficient safetyrelated latent knowledge, but RT fails to consistently elicit this knowledge when addressing OOD attacks. Further analysis based on domain adaptation reveals that training with direct refusal causes model to rely on superficial shortcuts, resulting in learning of non-robust representation mappings. Based on our findings, we propose training model to perform safety reasoning for each query. Reasoning supervision encourages model to perform more computations, explicitly eliciting and using latent knowledge through reasoning. To achieve this, we synthesize reasoning supervision based on pre-guidelines, training the model to reason in alignment with them, thereby effectively eliciting and utilizing latent knowledge from diverse perspectives. Extensive experiments show that our method significantly improves generalization performance against OOD attacks.
Paperid:2774
Authors:Jonghyun Shin · Namjun Kim · Geonho Hwang · Sejun Park
Abstract:
Abstract:The exact minimum width that allows for universal approximation of unboundeddepth networks is known only for ReLU and its variants. In this work, we study the minimum width of networks using general activation functions. Specifically, we focus on squashable functions that can approximate the identity function and binary step function by alternatively composing with affine transformations. We show that for networks using a squashable activation function to universally approximate $L^p$ functions from $[0,1]^{d_x}$ to $\mathbb R^{d_y}$, the minimum width is $\max\{d_x,d_y,2\}$ unless $d_x=d_y=1$; the same bound holds for $d_x=d_y=1$ if the activation function is monotone. We then provide sufficient conditions for squashability and show that all non-affine analytic functions and a class of piecewise functions are squashable, i.e., our minimum width result holds for those general classes of activation functions.
Paperid:2775
Authors:Xin Wang · Hanxiao Tao · Re-Bing Wu
Abstract:
Quantum machine learning models incorporating data reuploading circuits have garnered significant attention due to their exceptional expressivity and trainability. However, their ability to generate accurate predictions on unseen data, referred to as the predictive performance, remains insufficiently investigated. This study reveals a fundamental limitation in predictive performance when deep encoding layers are employed within the data re-uploading model. Concretely, we theoretically demonstrate that when processing high-dimensional data with limited-qubit data re-uploading models, their predictive performance progressively degenerates to near random-guessing levels as the number of encoding layers increases. In this context, the repeated data uploading cannot mitigate the performance degradation. These findings are validated through experiments on both synthetic linearly separable datasets and real-world datasets. Our results demonstrate that when processing high-dimensional data, the quantum data re-uploading models should be designed with wider circuit architectures rather than deeper and narrower ones.
Paperid:2776
Authors:Changshuo Wang · Xiang Fang · Prayag Tiwari
Abstract:
Fewshot point cloud semantic segmentation effectively addresses data scarcity by identifyingnunlabeled query samples through semantic prototypes generated from a small set of labeled support samples. However, pre-training-based methods suffer from domain shifts and increased training time. Additionally, existing methods using DGCNN as the backbone have limited geometric structure modeling capabilities andnstruggle to bridge the categorical information gap between query and support sets. To address these challenges, we propose DyPolySeg, a pre-training-free Dynamic Polynomial fitting network for few-shot point cloud semantic segmentation. Specifically, we design a unified Dynamic Polynomial Convolution (DyPolyConv) that extracts flat and detailed features of local geometry through Low-order Convolution (LoConv) and Dynamic High-order Convolution (DyHoConv), complemented by Mamba Block for capturing global context information. Furthermore, we propose a lightweight Prototype Completion Module (PCM) that reduces structural differences through self-enhancement and interactive enhancement between query and support sets. Experiments demonstrate that DyPolySeg achieves state-of-the-art performance on S3DIS and ScanNet datasets.
Paperid:2777
Authors:Idan Attias · Avrim Blum · Keziah Naggita · Donya Saless · Dravyansh Sharma · Matthew Walter
Abstract:
Abstract:One of the most basic lower bounds in machine learning is that in nearly any nontrivial setting, it takes at least $1/\epsilon$ samples to learn to error $\epsilon$ (and more, if the classifier being learned is complex). However, suppose that data points are agents who have the ability to improve by a small amount if doing so will allow them to receive a (desired) positive classification. In that case, we may actually be able to achieve zero error by just being "close enough". For example, imagine a hiring test used to measure an agent's skill at some job such that for some threshold $\theta$, agents who score above $\theta$ will be successful and those who score below $\theta$ will not (i.e., learning a threshold on the line). Suppose also that by putting in effort, agents can improve their skill level by some small amount $r$. In that case, if we learn an approximation $\hat{\theta}$ of $\theta$ such that $\theta \leq \hat{\theta} \leq \theta + r$ and use it for hiring, we can actually achieve error zero, in the sense that (a) any agent classified as positive is truly qualified, and (b) any agent who truly is qualified can be classified as positive by putting in effort. Thus, the ability for agents to improve has the potential to allow for a goal one could not hope to achieve in standard models, namely zero error.In this paper, we explore this phenomenon more broadly, giving general results and examining under what conditions the ability of agents to improve can allow for a reduction in the sample complexity of learning, or alternatively, can make learning harder. We also examine both theoretically and empirically what kinds of improvementaware algorithms can take into account agents who have the ability to improve to a limited extent when it is in their interest to do so.
Paperid:2778
Authors:Bhargav Ganguly · Yang Xu · Vaneet Aggarwal
Abstract:
Abstract:This paper investigates the potential of quantum acceleration in addressing infinite horizon Markov Decision Processes (MDPs) to enhance average reward outcomes. We introduce an innovative quantum framework for the agent's engagement with an unknown MDP, extending the conventional interaction paradigm. Our approach involves the design of an optimismdriven tabular Reinforcement Learning algorithm that harnesses quantum signals acquired by the agent through efficient quantum mean estimation techniques. Through thorough theoretical analysis, we demonstrate that the quantum advantage in mean estimation leads to exponential advancements in regret guarantees for infinite horizon Reinforcement Learning. Specifically, the proposed Quantum algorithm achieves a regret bound of $\tilde{\mathcal{O}}(1)$\footnote{$\tilde{\mathcal{O}}(\cdot)$ conceals logarithmic terms of $T$.}, a significant improvement over the $\tilde{\mathcal{O}}(\sqrt{T})$ bound exhibited by classical counterparts, where $T$ is the length of the time horizon.
Paperid:2779
Authors:Lily Zhang · Hamid Dadkhahi · Mara Finkelstein · Firas Trabelsi · Jiaming Luo · Markus Freitag
Abstract:
Despite growing interest in incorporating feedback to improve language models, most efforts focus only on sequencelevel annotations. In this work, we explore the potential of utilizing fine-grained span-level annotations from offline datasets to improve model quality. We develop a simple finetuning algorithm, called Training with Annotations (TWA), to directly train machine translation models on such annotated data. TWA utilizes targeted span-level error information while also flexibly learning what to penalize within a span. Moreover, TWA considers the overall trajectory of a sequence when deciding which non-error spans to utilize as positive signals. Experiments on English-German and Chinese-English machine translation show that TWA outperforms baselines such as supervised finetuning on sequences filtered for quality and Direct Preference Optimization on pairs constructed from the same data.
Paperid:2780
Authors:Qingyu Yin · Chak Tou Leong · Hongbo Zhang · Minjun Zhu · Hanqi Yan · Qiang Zhang · Yulan He · Wenjie Li · Jun Wang · Yue Zhang · Linyi Yang
Abstract:
The alignment of large language models (LLMs) with human preferences remains a key challenge. While posttraining techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often experience computational inefficiencies and training instability. In this paper, we propose \textbf{F}eature-level constrained \textbf{P}reference \textbf{O}ptimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves an above 5\% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.
Paperid:2781
Authors:Zahra Babaiee · Peyman Kiasari · Daniela Rus · Radu Grosu
Abstract:
Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization'—the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graphbased tasks designed to evaluate and improve AI systems’ capacity for visual abstraction. VGA uses graphs rendered in diverse layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models (ViT, Swin Transformers, ConvNeXt) and multimodal LLMs (GPT-o1, Claude 3.5 Sonnet) reveal a striking divide: human participants achieved near-perfect accuracy (88–100\%) across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine conceptual understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the Visual Graph Arena provides a framework to drive progress toward human-like conceptualization in AI visual models.
Paperid:2782
Authors:Zhen Liu · Yicheng Luo · Boyuan Li · Emadeldeen Eldele · Min Wu · Qianli Ma
Abstract:
Shapelets are discriminative subsequences (or shapes) with high interpretability in time series classification. Due to the timeintensive nature of shapelet discovery, existing shapelet-based methods mainly focus on selecting discriminative shapes while discarding others to achieve candidate subsequence sparsification. However, this approach may exclude beneficial shapes and overlook the varying contributions of shapelets to classification performance. To this end, we propose a Soft sparse Shapes (SoftShape) model for efficient time series classification. Our approach mainly introduces soft shape sparsification and soft shape learning blocks. The former transforms shapes into soft representations based on classification contribution scores, merging lower-scored ones into a single shape to retain and differentiate all subsequence information. The latter facilitates intra- and inter-shape temporal pattern learning, improving model efficiency by using sparsified soft shapes as inputs. Specifically, we employ a learnable router to activate a subset of class-specific expert networks for intra-shape pattern learning. Meanwhile, a shared expert network learns inter-shape patterns by converting sparsified soft shapes into sequences. Extensive experiments demonstrate that SoftShape outperforms the state-of-the-art methods and trains efficiently.
Paperid:2783
Authors:Luca Masserano · Abdul Fatir Ansari · Boran Han · Xiyuan Zhang · Christos Faloutsos · Michael Mahoney · Andrew Wilson · Youngsuk Park · Syama Sundar Yadav Rangapuram · Danielle Maddix · Yuyang Wang
Abstract:
How to best develop foundational models for time series forecasting remains an important open question. Tokenization is a crucial consideration in this effort: what is an effective discrete vocabulary for a realvalued sequential input? To address this question, we develop WaveToken, a wavelet-based tokenizer that allows models to learn complex representations directly in the space of time-localized frequencies. Our method first scales and decomposes the input time series, then thresholds and quantizes the wavelet coefficients, and finally pre-trains an autoregressive model to forecast coefficients for the forecast horizon. By decomposing coarse and fine structures in the inputs, wavelets provide an eloquent and compact language for time series forecasting that simplifies learning. Empirical results on a comprehensive benchmark, including 42 datasets for both in-domain and zero-shot settings, show that WaveToken: i) performs on par or better than recently proposed foundation models for forecasting while using a much smaller vocabulary (1024 tokens), and is competitive with modern deep learning models trained specifically on each dataset; ii) exhibits superior generalization capabilities, achieving the best average rank across all datasets for three complementary metrics; and iii) easily captures complex temporal patterns of practical relevance that are challenging for other recent pre-trained models, including trends, sparse spikes, and non-stationary time series with varying frequencies evolving over time.
Paperid:2784
Authors:Zachary Roch · George Atia · Yue Wang
Abstract:
Robust reinforcement learning (RL) under the average reward criterion, which seeks to optimize longterm system performance in uncertain environments, remains a largely unexplored area. To address this challenge, we propose a reduction-based framework that transforms robust average reward optimization into the more extensively studied robust discounted reward optimization by employing a specific discount factor. Our framework provides two key advantages.Data Efficiency: We design a model-based reduction algorithm that achieves near-optimal sample complexity, enabling efficient identification of optimal robust policies;Scalability: By bypassing the inherent challenges of scaling up average reward optimization, our framework facilitates the design of scalable, convergent algorithms for robust average reward optimization leveraging function approximation. Our algorithmic design, supported by theoretical and empirical analyses, provides a concrete solution to robust average reward RL with the first data efficiency and scalability guarantees, highlighting the framework’s potential to optimize long-term performance under model uncertainty in practical problems.
Paperid:2785
Authors:Waïss Azizian · Franck Iutzeler · Jérôme Malick · Panayotis Mertikopoulos
Abstract:
In this paper, we examine the time it takes for stochastic gradient descent (SGD) to reach the global minimum of a general, nonconvex loss function. We approach this question through the lens of large deviations theory and randomly perturbed dynamical systems, and we provide a tight characterization of the associated hitting times of SGD with matching upper and lower bounds. Our analysis reveals that the global convergence time of SGD is dominated by the most "costly" set of obstacles that the algorithm may need to overcome in order to reach a global minimizer, coupling in this way the geometry of the underlying loss landscape with the statistics of the noise entering the process. Finally, motivated by applications to the training of deep neural networks, we provide a series of refinements and extensions of our analysis to, among others, loss functions with no spurious local minima or ones with bounded depths.
Paperid:2786
Authors:Filipp Zmushko · Aleksandr Beznosikov · Martin Takac · Samuel Horváth
Abstract:
With the increase in the number of parameters in large language models, the training process increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as lowrank adaptation (LoRA), low-rank gradient projection (GaLore), and blockwise optimization (BAdam) have been proposed. However, in all these algorithms, theeffective rank of the weight updates remains low-rank, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce 𝙵𝚁𝚄𝙶𝙰𝙻 (Full-RankUpdates withGrAdient spLitting), a new memory-efficient optimization framework. 𝙵𝚁𝚄𝙶𝙰𝙻 leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD. Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.
Paperid:2787
Authors:Mingyu Xing · Lechao Cheng · Shengeng Tang · Yaxiong Wang · Zhun Zhong · Meng Wang
Abstract:
We introduce \textbf{Knowledge Swapping}, a novel task designed to selectively regulate knowledge of a pretrained model by enabling the forgetting of userspecified information, retaining essential knowledge, and acquiring new knowledge simultaneously. By delving into the analysis of knock-on feature hierarchy, we find that incremental learning typically progresses from low-level representations to higher-level semantics, whereas forgetting tends to occur in the opposite direction—starting from high-level semantics and moving down to low-level features. Building upon this, we propose to benchmark the knowledge swapping task with the strategy of \textit{Learning Before Forgetting}.Comprehensive experiments on various tasks like image classification, object detection, and semantic segmentation validate the effectiveness of the proposed strategy.
Paperid:2788
Authors:William de Vazelhes · Xiaotong Yuan · Bin Gu
Abstract:
Abstract:In sparse optimization, enforcing hard constraints using the $\ell_0$ pseudonorm offers advantages like controlled sparsity compared to convex relaxations. However, many real-world applications demand not only sparsity constraints but also some extra constraints. While prior algorithms have been developed to address this complex scenario with mixed combinatorial and convex constraints, they typically require the closed form projection onto the mixed constraints which might not exist, and/or only provide local guarantees of convergence which is different from the global guarantees commonly sought in sparse optimization. To fill this gap, in this paper, we study the problem of sparse optimization with extra *support-preserving* constraints commonly encountered in the literature. We present a new variant of iterative hard-thresholding algorithm equipped with a two-step consecutive projection operator customized for these mixed constraints, serving as a simple alternative to the Euclidean projection onto the mixed constraint. By introducing a novel trade-off between sparsity relaxation and sub-optimality, we provide global guarantees in objective value for the output of our algorithm, in the deterministic, stochastic, and zeroth-order settings, under the conventional restricted strong-convexity/smoothness assumptions. As a fundamental contribution in proof techniques, we develop a novel extension of the classic three-point lemma to the considered two-step non-convex projection operator, which allows us to analyze the convergence in objective value in an elegant way that has not been possible with existing techniques. In the zeroth-order case, such technique also improves upon the state-of-the-art-result from (de Vazelhes et al., 2022), even in the case without additional constraints, by allowing us to remove a non-vanishing system error present in their work.
Paperid:2789
Authors:Yuhao Mao · Stefan Balauca · Martin Vechev
Abstract:
Training certifiably robust neural networks is an important but challenging task. While many algorithms for (deterministic) certified training have been proposed, they are often evaluated on different training schedules, certification methods, and systematically undertuned hyperparameters, making it difficult to compare their performance. To address this challenge, we introduce CTBench, a unified library and a high-quality benchmark for certified training that evaluates all algorithms under fair settings and systematically tuned hyperparameters. We show that (1) almost all algorithms in CTBench surpass the corresponding reported performance in literature in the magnitude of algorithmic improvements, thus establishing new state-of-the-art, and (2) the claimed advantage of recent algorithms drops significantly when we enhance the outdated baselines with a fair training schedule, a fair certification method and well-tuned hyperparameters. Based on CTBench, we provide new insights into the current state of certified training, including (1) certified models have less fragmented loss surface, (2) certified models share many mistakes, (3) certified models have more sparse activations, (4) reducing regularization cleverly is crucial for certified training especially for large radii and (5) certified training has the potential to improve out-of-distribution generalization. We are confident that CTBench will serve as a benchmark and testbed for future research in certified training.
Paperid:2790
Authors:Ruida Wang · Rui Pan · Yuxin Li · Jipeng Zhang · Yizhen Jia · shizhe diao · Renjie Pi · Junjie Hu · Tong Zhang
Abstract:
Solving mathematical problems using computerverifiable languages like Lean has significantly impacted mathematical and computer science communities. State-of-the-art methods utilize single Large Language Models (LLMs) as agents or provers to either generate complete proof or perform tree searches. However, single-agent methods inherently lack a structured way to combine high-level reasoning in Natural Language (NL) with Formal Language (FL) verification feedback. To solve the issues, we proposeMA-LoT:Multi-Agent Lean-based Long Chain-of-Thoughtframework, (to the best of our knowledge), the first multi-agent framework for Lean4 theorem proving that balance high-level NL reasoning and FL verification in Long CoT. Using this structured interaction, our approach enables deeper insights and long-term coherence in proof generation, of which past methods struggle. We do this by leveraging emergent formal reasoning ability in Long CoT using our novelLoT-Transfer Learningtraining-inference pipeline. Extensive experiment shows that our framework achieves \textbf{54.92%} accuracy rate on the Lean4 version of MiniF2F-Test dataset, largely outperforming GPT-4 (22.95%), single-agent tree search (InternLM-Step-Prover, 49.18%), and whole-proof generation (DeepSeek-Prover-v1.5, 48.30%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective.
Paperid:2791
Authors:Weiguo Wang · Andy Nie · Wenrui Zhou · Yi Kai · Chengchen Hu
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in text and multimodal processing, yet they fundamentally lack physical awarenessunderstanding of real-world physical phenomena.We explore the feasibility of teaching LLMs physical awareness through sound, focusing on fundamental physical phenomena like the Doppler effect, multipath effect, and spatial relationships. To overcome data scarcity, we introduce a physics-based simulator combining real-world sound sources with controlled physical channels to generate diverse training data. Using this simulator, we build AQA-PHY, a comprehensive Audio Question-Answer dataset, and propose an audio encoder that processes both magnitude and phase information. By connecting our audio encoder to state-of-the-art LLMs, we demonstrate reasonable results in both simulated and real-world tasks, such as line-of-sight detection, Doppler effect estimation, and Direction-of-Arrival estimation, paving the way for enabling LLMs to understand physical phenomena.
Paperid:2792
Authors:Taejin Park · Ivan Medennikov · Kunal Dhawan · Weiqing Wang · He Huang · Nithin Koluguri · Krishna Puvvada · Jagadeesh Balam · Boris Ginsburg
Abstract:
Sortformer is an encoderbased speaker diarization model designed to supervise speaker tagging in speech-to-text models. Instead of relying solely on permutation invariant loss (PIL), Sortformer introduces Sort Loss to resolve the permutation problem, either independently or in tandem with PIL. In addition, we propose a streamlined multi-speaker speech-to-text architecture that leverages Sortformer for speaker supervision, embedding speaker labels into the encoder using sinusoidal kernel functions. This design addresses the speaker permutation problem through sorted objectives, effectively bridging timestamps and tokens to supervise speaker labels in the output transcriptions. Experiments demonstrate that Sort Loss can boost speaker diarization performance, and incorporating the speaker supervision from Sortformer improves multi-speaker transcription accuracy. We anticipate that the proposed Sortformer and multi-speaker architecture will enable the seamless integration of speaker tagging capabilities into foundational speech-to-text systems and multimodal LLMs, offering an easily adoptable and user-friendly mechanism to enhance their versatility and performance in speaker-aware tasks. The code and trained models will be made publicly available through an open-source framework.
Paperid:2793
Authors:Deheng Yuan · Tao Guo · Zhongyi Huang
Abstract:
Consider the communicationconstrained problem of nonparametric function estimation, in which each distributed terminal holds multiple i.i.d. samples. Under certain regularity assumptions, we characterize the minimax optimal rates for all regimes, and identify phase transitions of the optimal rates as the samples per terminal vary from sparse to dense. This fully solves the problem left open by previous works, whose scopes are limited to regimes with either dense samples or a single sample per terminal. To achieve the optimal rates, we design a layered estimation protocol by exploiting protocols for the parametric density estimation problem. We show the optimality of the protocol using information-theoretic methods and strong data processing inequalities, and incorporating the classic balls and bins model. The optimal rates are immediate for various special cases such as density estimation, Gaussian, binary, Poisson and heteroskedastic regression models.
Paperid:2794
Authors:Austin Nguyen · Anri Gu · Michael Wellman
Abstract:
Abstract:Iterative extension of empirical game models through deep reinforcement learning (RL) has proved an effective approach for finding equilibria in complex games. When multiple equilibria exist, we may also be interested in finding solutions with particular characteristics. We address this issue of equilibrium selection in the context of Policy Space Response Oracles (PSRO), a flexible gamesolving framework based on deep RL, by skewing the strategy exploration process towards higher-welfare solutions. At each iteration, we create an exploration policy that imitates high welfare-yielding behavior and train a response to the current solution, regularized to be similar to the exploration policy. With no additional simulation expense, our approach, named Ex$^2$PSRO, tends to find higher welfare equilibria than vanilla PSRO in two benchmarks: a sequential bargaining game and a social dilemma game. Further experiments demonstrate Ex$^2$PSRO's composability with other PSRO variants and illuminate the relationship between exploration policy choice and algorithmic performance.
Paperid:2795
Authors:Soumyabrata Kundu · Risi Kondor
Abstract:
Abstract:In this work we introduce Steerable Transformers, an extension of the Vision Transformer mechanism that maintains equivariance to the special Euclidean group $\mathrm{SE}(d)$. We propose an equivariant attention mechanism that operates on features extracted by steerable convolutions. Operating in Fourier space, our network utilizes Fourier space nonlinearities. Our experiments in both two and three dimensions show that adding a steerable transformer encoder layer to a steerable convolution network enhances performance.
Paperid:2796
Authors:Onur Celik · Zechu Li · Denis Blessing · Ge Li · Daniel Palenicek · Jan Peters · Georgia Chalvatzaki · Gerhard Neumann
Abstract:
Maximum entropy reinforcement learning (MaxEntRL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into MaxEnt-RL poses challenges—primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). DIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity.
Paperid:2797
Authors:Xiang Li · Neil Chowdhury · Daniel D. Johnson · Tatsunori Hashimoto · Percy Liang · Sarah Schwettmann · Jacob Steinhardt
Abstract:
Language models exhibit complex, diverse behaviors when prompted with freeform text, making it hard to characterize the space of possible outputs. We study the problem of behavioral elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations, harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train amortized investigator models to emulate the posterior distribution over the prompts, conditioned on the target behavior. Specifically, we first fit a reverse model and then use reinforcement learning to optimize likelihood of generating the target behavior. To improve the diversity of the prompt distribution, we further propose a novel iterative training objective based on the Frank-Wolfe algorithm that encourages each iteration to discover different sets of prompts not captured by previous iterations. Our investigator models produce prompts that exhibit a variety of effective and human-interpretable strategies for behavior elicitation, obtaining a 100% attack success rate on AdvBench (Harmful Behaviors) and an 85% hallucination rate.
Paperid:2798
Authors:Leello Dadi · Volkan Cevher
Abstract:
We study generalization of iterative noisy gradient schemes on smooth nonconvex losses. Formally, we establish time-independent information theoretic generalization bounds for Stochastic Gradient Langevin Dynamics (SGLD) that do not diverge as the iteration count increases. Our bounds are obtained through a stability argument: we analyze the difference between two SGLD sequences ran in parallel on two datasets sampled from the same distribution. Our result only requires an isoperimetric inequality to hold, which is merely a restriction on the tails of the loss. Our work relaxes the assumptions of prior work to establish that the iterates stay within a bounded KL divergence from each other. Under an additional dissipativity assumption, we show that the stronger Renyi divergence also stays bounded by establishing a uniform log-Sobolev constant of the iterates. Without dissipativity, we sidestep the need for local log-Sobolev inequalities and instead exploit the regularizing properties of Gaussian convolution. These techniques allow us to show that strong convexity is not necessary for finite stability bounds. Our work shows that noisy SGD can have finite, iteration-independent, generalization and differential privacy bounds in unbounded non-convex settings.
Paperid:2799
Authors:Jiarui Lu · Xiaoyin Chen · Stephen Lu · Aurelie Lozano · Vijil Chenthamarakshan · Payel Das · Jian Tang
Abstract:
Protein dynamics play a crucial role in protein biological functions and properties, and their traditional study typically relies on timeconsuming molecular dynamics (MD) simulations conducted in silico. Recent advances in generative modeling, particularly denoising diffusion models, have enabled efficient accurate protein structure prediction and conformation sampling by learning distributions over crystallographic structures. However, effectively integrating physical supervision into these data-driven approaches remains challenging, as standard energy-based objectives often lead to intractable optimization. In this paper, we introduce Energy-based Alignment (EBA), a method that aligns generative models with feedback from physical models, efficiently calibrating them to appropriately balance conformational states based on their energy differences. Experimental results on the MD ensemble benchmark demonstrate that EBA achieves state-of-the-art performance in generating high-quality protein ensembles. By improving the physical plausibility of generated structures, our approach enhances model predictions and holds promise for applications in structural biology and drug discovery.
Paperid:2800
Authors:Abdulrahman Diaa · Toluwani Aremu · Nils Lukas
Abstract:
Large Language Models (LLMs) can be misused to spread online spam and misinformation. Content watermarking deters misuse by hiding a message in generated outputs, enabling detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content’s quality. Many LLM watermarking methods have been proposed, but robustness is tested only against nonadaptive attackers who lack knowledge of the provider’s watermarking method and can find only suboptimal attacks. We formulate the robustness of LLM watermarking as an objective function and use preference-based optimization to tune adaptive attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarking methods. (ii) Even in a non-adaptive setting, attacks optimized adaptively against known watermarks remain effective when tested on unseen watermarks, and (iii) optimization-based attacks are scalable and use limited computational resources of less than seven GPU hours. Our findings underscore the need to test robustness against adaptive attacks.
Paperid:2801
Authors:Yafu Li · Xuyang Hu · Xiaoye Qu · Linjie Li · Yu Cheng
Abstract:
Large language models (LLMs) have presented impressive performance but often lack the flexibility to adapt to human preferences quickly without retraining. Inspired by the recent efforts on testtime scaling, we make the first attempt to propose Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, eliminating the need to update model parameters. Instead of relying on purely numerical rewards, TPO translates reward signals into \emph{textual} critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth of the inference process. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly.
Paperid:2802
Authors:Tiansheng Huang · Gautam Bhattacharya · Pratik Joshi · Joshua Kimball · Ling Liu
Abstract:
Safety aligned Large Language Models (LLMs) are vulnerable to harmful finetuning attacks \cite{qi2023fine}-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. While several defenses have been proposed, our evaluation shows that existing defenses fail \textit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks. Our code is attached in supplementary materials.
Paperid:2803
Authors:Kei Sen Fong · Mehul Motani
Abstract:
In this work, we propose a novel approach that combines the strengths of FEAT (Feature Engineering Automation Tool) and TabNet (a deep learning model for tabular data) through knowledge distillation (KD), which we term FEATKD. FEAT is an intrinsically interpretable machine learning algorithm that constructs a weighted linear combination of concisely-represented features discovered via genetic programming. FEAT is a first-class algorithm for real-world applications, such as healthcare, that values intrinsic `white-box' explainability over post-hoc explanations. However, genetic programming in FEAT can be grossly inefficient, motivating the integration of deep learning techniques to enhance its optimization capabilities. To address this limitation, FEAT-KD leverages TabNet’s deep-learning-based optimization and feature selection mechanisms. FEAT-KD finds a weighted linear combination of concisely-represented, symbolic features that are derived from piece-wise distillation of a trained TabNet model. We analyze FEAT-KD on regression tasks from two perspectives: (i) Compared to TabNet, FEAT-KD significantly reduces model complexity while retaining competitive predictive performance, effectively converting a black-box deep learning model into a more interpretable white-box representation. (ii) Compared to FEAT, our method consistently outperforms in prediction accuracy, produces more compact models, and reduces the complexity of learned symbolic expressions. In addition, we demonstrate that FEAT-KD easily supports multi-target regression, in which the shared features contribute to the interpretability of the system. Our results suggest that FEAT-KD is a promising direction for interpretable machine learning, bridging the gap between deep learning’s predictive power and the intrinsic transparency of symbolic models.
Paperid:2804
Authors:Yikang Chen · Dehui du
Abstract:
Abstract:This paper investigates $\sim_{\mathcal{L}\_3}$identifiability, a form of complete counterfactual identifiability within the Pearl CausalHierarchy (PCH) framework, ensuring that all Structural Causal Models (SCMs) satisfying the given assumptions provide consistent answers to all causal questions. To simplify this problem, we introduce exogenous isomorphism and propose $\sim_{\mathrm{EI}}$-identifiability based on this concept, demonstrating that full recovery of exogenous variables in SCMs is not required to achieve $\sim_{\mathcal{L}\_3}$-consistency. We explore sufficient assumptions for achieving $\sim_{\mathrm{EI}}$-identifiability in two special classes of SCMs: Bijective SCMs (BSCMs), based on counterfactual transport, and Triangular Monotonic SCMs (TM-SCMs), which extend $\sim_{\mathcal{L}\_2}$-identifiability. Our results unify and generalize existing theories, providing theoretical guarantees for practical applications. Finally, we leverage neural TM-SCMs to address the consistency problem in counterfactual reasoning, with experiments validating both the effectiveness of our method and the correctness of the theory.
Paperid:2805
Authors:Wenshuai Zhao · Zhiyuan Li · Joni Pajarinen
Abstract:
The number of agents can be an effective curriculum variable for controlling the difficulty of multiagent reinforcement learning (MARL) tasks. Existing work typically uses manually defined curricula such as linear schemes. We identify two potential flaws while applying existing reward-based automatic curriculum learning methods in MARL: (1) The expected episode return used to measure task difficulty has high variance; (2) Credit assignment difficulty can be exacerbated in tasks where increasing the number of agents yields higher returns which is common in many MARL tasks. To address these issues, we propose to control the curriculum by using a TD-error basedlearning progressmeasure and by letting the curriculum proceed from an initial context distribution to the final task specific one. Since our approach maintains a distribution over the number of agents and measures learning progress rather than absolute performance, which often increases with the number of agents, we alleviate problem (2). Moreover, the learning progress measure naturally alleviates problem (1) by aggregating returns. In three challenging sparse-reward MARL benchmarks, our approach outperforms state-of-the-art baselines.
Paperid:2806
Authors:Weijian Luo · colin zhang · Debing Zhang · Zhengyang Geng
Abstract:
We propose DiffInstruct(DI), a data-efficient post-training approach to one-step text-to-image generative models to improve its human preferences without requiring image data. Our method frames alignment as online reinforcement learning from human feedback (RLHF), which optimizes a human reward function while regularizing the generator to stay close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the KL divergence for regularization, we introduce a novel score-based divergence regularization that substantially improves performance. Although such a score-based RLHF objective seems intractable when optimizing, we derive a strictly equivalent tractable loss function in theory that can efficiently compute its gradient for optimizations. Building upon this framework, we train DI-SDXL-1step, a 1-step text-to-image model based on Stable Diffusion-XL (2.6B parameters), capable of generating 1024x1024 resolution images in a single step. The 2.6B DI-SDXL-1step model outperforms the 12B FLUX-dev model in ImageReward, PickScore, and CLIP score on the Parti prompts benchmark while using only 1.88% of the inference time. This result strongly supports the thought that with proper post-training, the small one-step model is capable of beating huge multi-step models. We will open-source our industry-ready model to the community.
Paperid:2807
Authors:Kevin Rojas · Yuchen Zhu · Sichen Zhu · Felix Ye · Molei Tao
Abstract:
Diffusion models have shown remarkable performance in generating unimodal data in various tasks, such as image, video, and text generation. On the contrary, the joint generation of multimodal data through diffusion models is still in the early stages of exploration. Existing approaches heavily rely on external preprocessing protocols such as VAE to harmonize varied data representation into a unified unimodal format. This process heavily demands the high accuracy of encoder/decoders, which could potentially be problematic for applications with little data. To lift this restriction, we propose a novel framework for building multimodal diffusion models on arbitrary state spaces and enabling the generation of different modalities' coupled data. By introducing an innovative decoupled noise schedule for each modality, we simultaneously allow the unconditional and modalityconditioned generation in one single model. We empirically validate our approach on text-image generation and mixed-type tabular data synthesis and demonstrate that it achieves competitive performance.
Paperid:2808
Authors:Jade Garcia Bourrée · Augustin Godinot · Martijn de Vos · Milos Vujasinovic · Sayan Biswas · Gilles Tredan · Erwan Le Merrer · Anne-Marie Kermarrec
Abstract:
The rapid adoption of ML decisionmaking systems across products and services has led to a set of regulations on how such systems should behave and be built. Among all the technical challenges to enforcing these regulations, one crucial, yet under-explored problem is the risk of manipulation while these systems are being audited for fairness.This manipulation occurs when a platform deliberately alters its answers to a regulator to pass an audit without modifying its answers to other users.In this paper, we introduce a novel approach to manipulation-proof auditing by taking into account the auditor's prior knowledge of the task solved by the platform. We first demonstrate that regulators must not rely on public priors (e.g. a public dataset), as platforms could easily fool the auditor in such cases. We then formally establish the conditions under which an auditor can prevent audit manipulations using prior knowledge about the ground truth. Finally, our experiments with two standard datasets exemplify the maximum level of unfairness a platform can hide before being detected as malicious.Our formalization and generalization of manipulation-proof auditing with a prior opens up new research directions for more robust fairness audits.
Paperid:2809
Authors:Mohammadreza Daneshvaramoli · Helia Karisani · Adam Lechowicz · Bo Sun · Cameron Musco · Mohammad Hajiesmaili
Abstract:
This paper introduces a family of learningaugmented algorithms for online knapsack problems that achieve near Pareto-optimal consistency-robustness trade-offs through a simple combination of trusted learning-augmented and worst-case algorithms. Our approach relies on succinct, practical predictions—single values or intervals estimating the minimum value of any item in an offline solution. Additionally, we propose a novel fractional-to-integral conversion procedure, offering new insights for online algorithm design.
Paperid:2810
Authors:Aniket Vashishtha · Abhinav Kumar · Atharva Pandey · Abbavaram Gowtham Reddy · Kabir Ahuja · Vineeth N Balasubramanian · Amit Sharma
Abstract:
For textbased AI systems to interact in the real world, causal reasoning is an essential skill. Since interventional data is costly to generate, we study to what extent an agent can learn causal reasoning from passive data. Specifically, we consider an axiomatic training setup where an agent learns from multiple demonstrations of a causal axiom (or rule), rather than incorporating the axiom as an inductive bias or inferring it from data values. A key question is whether the agent would learn to generalize from the axiom demonstrations to new scenarios. For example, if a transformer model is trained on demonstrations of the causal transitivity axiom over small graphs, would it generalize to applying the transitivity axiom over large graphs? Our results, based on a novel axiomatic training scheme, indicate that such generalization is possible. We consider the task of inferring whether a variable causes another variable, given a causal graph structure. We find that a 67 million parameter transformer model, when trained on linear causal chains (along with some noisy variations) can generalize well to new kinds of graphs, including longer causal chains, causal chains with reversed order, and graphs with branching; even when it is not explicitly trained for such settings. Our model performs at par (or even better) than many larger language models such as GPT-4, Gemini Pro, and Phi-3. Overall, our axiomatic training framework provides a new paradigm of learning causal reasoning from passive data that can be used to learn arbitrary axioms, as long as sufficient demonstrations can be generated.
Paperid:2811
Authors:Yuhuan Yang · Chaofan Ma · Zhenjie Mao · Jiangchao Yao · Ya Zhang · Yanfeng Wang
Abstract:
Video understanding is a complex challenge that requires effective modeling of spatialtemporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to processspatial and temporal information separately,which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba's selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency. Extensive experiments on multiple video benchmarks demonstrate the effectiveness of MoMa, achieving superior performance with reduced computational cost. Codes will be released upon publication.
Paperid:2812
Authors:Ruizhe Chen · Dongyu Xue · Xiangxin Zhou · Zaixiang Zheng · xiangxiang Zeng · Quanquan Gu
Abstract:
Research on singlechain protein modeling has been extensively and deeply explored, with advancements seen in models like the series of ESM and AlphaFold. Despite these developments, the study and modeling of multi-chain proteins remain largely uncharted, though they are vital for understanding biological functions. Proteins typically exist in complexes, interacting with other proteins to perform their specific biological roles. Recognizing the importance of these interactions, we introduce APM (All-AtomProtein GenerativeModel), a model specifically designed for modeling multi-chain proteins. By integrating atom-level information and leveraging data on multi-chain proteins, APM is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch. It also performs folding and inverse-folding tasks for multi-chain proteins. Moreover, APM demonstrates versatility in downstream applications: it can be supervised fine-tuned (SFT) for enhanced performance while also supporting zero-shot sampling in certain tasks, achieving state-of-the-art results.
Paperid:2813
Authors:Junlong Ke · Yazhou Ren · Zichen Wen · Tianyi Wu · Yang Yang · Xiaorong Pu · Lifang He
Abstract:
Multiview clustering has gained significant attention for integrating multi-view information in multimedia applications. With the growing complexity of graph data, multi-view graph clustering (MVGC) has become increasingly important. Existing methods primarily use Graph Neural Networks (GNNs) to encode structural and feature information, but applying GNNs within contrastive learning poses specific challenges, such as integrating graph data with node features and handling both homophilic and heterophilic graphs. To address these challenges, this paper introduces Node-Guided Contrastive Encoding (NGCE), a novel MVGC approach that leverages node features to guide embedding generation. NGCE enhances compatibility with GNN filtering, effectively integrates homophilic and heterophilic information, and strengthens contrastive learning across views. Extensive experiments demonstrate its robust performance on six homophilic and heterophilic multi-view benchmark datasets.
Paperid:2814
Authors:Yujun Huang · Bin Chen · Niu Lian · Xin Wang · Baoyi An · Tao Dai · Shutao Xia
Abstract:
Existing multiview image compression methods often rely on 2D projection-based similarities between views to estimate disparities. While effective for small disparities, such as those in stereo images, these methods struggle with the more complex disparities encountered in wide-baseline multi-camera systems, commonly found in virtual reality and autonomous driving applications. To address this limitation, we propose 3D-LMVIC, a novel learning-based multi-view image compression framework that leverages 3D Gaussian Splatting to derive geometric priors for accurate disparity estimation. Furthermore, we introduce a depth map compression model to minimize geometric redundancy across views, along with a multi-view sequence ordering strategy based on a defined distance measure between views to enhance correlations between adjacent views. Experimental results demonstrate that 3D-LMVIC achieves superior performance compared to both traditional and learning-based methods. Additionally, it significantly improves disparity estimation accuracy over existing two-view approaches.
Paperid:2815
Authors:Guangzheng Hu · Feng Liu · Mingming Gong · Guanghui Wang · Liuhua Peng
Abstract:
Data imbalance is a common factor hindering classifier performance. Datalevel approaches for imbalanced learning, such as resampling, often lead to information loss or generative errors. Building on theoretical studies of imbalance ratio in binary classification, it is found that adding suitable label noise can adjust biased decision boundaries and improve classifier performance. This paper proposes the Label-Noise-based Re-balancing (LNR) approach to solve imbalanced learning by employing a novel design of asymmetric label noise model. In contrast to other data-level methods, LNR alleviates the issues of informative loss and generative errors and can be integrated seamlessly with any classifier or algorithm-level method. We validated the superiority of LNR on synthetic and real-world datasets. Our work opens a new avenue for imbalanced learning, highlighting the potential of beneficial label noise.
Paperid:2816
Authors:Shiming Chen · Dingjie Fu · Salman Khan · Fahad Khan
Abstract:
Abstract:Remarkable progress in zeroshot learning (ZSL) has been achieved using generative models. However, existing generative ZSL methods merely generate (imagine) the visual features from scratch guided by the strong class semantic vectors annotated by experts, resulting in suboptimal generative performance and limited scene generalization. To address these and advance ZSL, we propose an inductive variational autoencoder for generative zero-shot learning, dubbed GenZSL. Mimicking human-level concept learning, GenZSL operates by inducting new class samples from similar seen classes using weak class semantic vectors derived from target class names (i.e., CLIP text embedding). To ensure the generation of informative samples for training an effective ZSL classifier, our GenZSL incorporates two key strategies. Firstly, it employs class diversity promotion to enhance the diversity of class semantic vectors. Secondly, it utilizes target class-guided information boosting criteria to optimize the model. Extensive experiments conducted on three popular benchmark datasets showcase the superiority and potential of our GenZSL with significant efficacy and efficiency over f-VAEGAN, e.g., 24.7\% performance gains and more than $60\times$ faster training speed on AWA2.
Paperid:2817
Authors:Faisal Hamman · Sachindra P Dissanayake · Saumitra Mishra · Freddy Lecue · Sanghamitra Dutta
Abstract:
Finetuning LLMs on tabular classification tasks can lead to the phenomenon of \emph{fine-tuning multiplicity} where equally well-performing models make conflicting predictions on the same input. Fine-tuning multiplicity can arise due to variations in the training process, e.g., seed, weight initialization, minor changes to training data, etc., raising concerns about the reliability of Tabular LLMs in high-stakes applications such as finance, hiring, education, healthcare. Our work formalizes this unique challenge of fine-tuning multiplicity in Tabular LLMs and proposes a novel measure to quantify the stability of individual predictions without expensive model retraining. Our measure quantifies a prediction's stability by analyzing (sampling) the model's local behavior around that input in the embedding space. Interestingly, we show that sampling in the local neighborhood can be leveraged to provide probabilistic guarantees on prediction robustness under a broad class of fine-tuned models, i.e., inputs with sufficiently high stability (as defined by our measure) also remain robust across several fine-tuned models with high probability. We perform experiments on multiple real-world datasets to show that our stability measure preemptively captures robustness under actual multiplicity across several fine-tuned models, outperforming competing measures.
Paperid:2818
Authors:Baohao Liao · Yuhui Xu · Hanze Dong · Junnan Li · Christof Monz · Silvio Savarese · Doyen Sahoo · Caiming Xiong
Abstract:
We introduce RewardGuided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4X fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.
Paperid:2819
Authors:Jerome Garnier-Brun · Marc Mezard · Emanuele Moscato · Luca Saglietti
Abstract:
Understanding the learning process and the embedded computation in transformers is becoming a central goal for the development of interpretable AI. In the present study, we introduce a hierarchical filtering procedure for data models of sequences on trees, allowing us to handtune the range of positional correlations in the data. Leveraging this controlled setting, we provide evidence that vanilla encoder-only transformers can approximate the exact inference algorithm when trained on root classification and masked language modeling tasks, and studyhowthis computation is discovered and implemented. We find that correlations at larger distances, corresponding to increasing layers of the hierarchy, are sequentially included by the network during training. By comparing attention maps from models trained with varying degrees of filtering and by probing the different encoder levels, we find clear evidence of a reconstruction of correlations on successive length scales corresponding to the various levels of the hierarchy, which we relate to a plausible implementation of the exact inference algorithm within the same architecture.
Paperid:2820
Authors:Guiliang Liu · Yueci Deng · Runyi Zhao · Huayi Zhou · Jian Chen · Jietao Chen · Ruiyan Xu · Yunxin Tai · Kui Jia
Abstract:
A critical prerequisite for achieving generalizable robot control is the availability of a largescale robot training dataset. Due to the expense of collecting realistic robotic data, recent studies explored simulating and recording robot skills in virtual environments. While simulated data can be generated at higher speeds, lower costs, and larger scales, the applicability of such simulated data remains questionable due to the gap of between simulated and realistic environments. To advance the Sim2Real generalization, in this study, we present DexSim, a data engine designed to perform automatic skills simulation and scaling for learning deployable robot manipulation policies. Specifically, DexSim ensures the usability of simulated skills by integrating diverse forms of realistic data into the simulated environment, preserving semantic alignment with the target tasks. For each simulated skill in the environment, DexSim facilitates effective Sim2Real data scaling by automating the process of domain randomization and adaptation. Tuned by the scaled dataset, the control policy achieves zero-shot Sim2Real generalization across diverse tasks, multiple robot embodiments, and widely studied policy model architectures, highlighting its importance in advancing Sim2Real embodied intelligence.
Paperid:2821
Authors:Annie Marsden · Evan Dogariu · Naman Agarwal · Xinyi Chen · Daniel Suo · Elad Hazan
Abstract:
We consider the problem of length generalization in sequence prediction. We define a new metric of performance in this setting – the AsymmetricRegret– which measures regret against a benchmark predictor with longer context length than available to the learner. We continue by studying this concept through the lens of the spectral filter-ing algorithm. We present a gradient-based learn-ing algorithm that provably achieves length generalization for linear dynamical systems. We conclude with proof-of-concept experiments which are consistent with our theory.
Paperid:2822
Authors:Shangbin Feng · Zifeng Wang · Yike Wang · Sayna Ebrahimi · Hamid Palangi · Lesly Miculicich · Achin Kulshrestha · Nathalie Rauschmayr · Yejin Choi · Yulia Tsvetkov · Chen-Yu Lee · Tomas Pfister
Abstract:
We propose Model Swarms, a collaborative search algorithm to adapt LLMs via swarm intelligence, the collective behavior guiding individual systems. Specifically, Model Swarms starts with a pool of LLM experts and a utility function. Guided by the bestfound checkpoints across models, diverse LLM experts collaboratively move in the weight space and optimize a utility function representing model adaptation objectives. Compared to existing model composition approaches, Model Swarms offers tuning-free model adaptation, works in low-data regimes with as few as 200 examples, and does not require assumptions about specific experts in the swarm or how they should be composed. Extensive experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests, improving over 12 model composition baselines by up to 21.0% across tasks and contexts. Further analysis reveals that LLM experts discover previously unseen capabilities in initial checkpoints and that Model Swarms enable the weak-to-strong transition of experts through the collaborative search process.
Paperid:2823
Authors:Ruoxin Yuan · Guanhua Fang
Abstract:
This work introduces Residual TPP, a novel, unified, and lightweight approach for analyzing event stream data. It leverages the strengths of both simple statistical TPPs and expressive neural TPPs to achieve superior performance. Specifically, we propose the Residual Events Decomposition (RED) technique in temporal point processes, which defines a weight function to quantify how well the intensity function captures the event characteristics. The RED serves as a flexible, plugand-play module that can be integrated with any TPP model in a wide range of tasks. It enables the identification of events for which the intensity function provides a poor fit, referred to as residual events.By combining RED with a Hawkes process, we capture the self-exciting nature of the data and identify residual events. Then an arbitrary neural TPP is employed to take care of residual events. Extensive experimental results demonstrate that Residual TPP consistently achieves state-of-the-art goodness-of-fit and prediction performance in multiple domains and offers significant computational advantages as well.
Paperid:2824
Authors:Vincent-Pierre Berges · Barlas Oğuz · Daniel HAZIZA · Scott Yih · Luke Zettlemoyer · Gargi Ghosh
Abstract:
Memory layers use a trainable keyvalue lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
Paperid:2825
Authors:Yifei Xu · Tusher Chakraborty · Emre Kiciman · Srinagesh Sharma · Bibek Aryal · Songwu Lu · Ranveer Chandra
Abstract:
Finetuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose Sargy, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve near-human annotation quality with minimal effort. Sargy identifies hard-to-annotate samples mislabeled by LLMs using a reward model’s reward distribution and iteratively enhances data quality by integrating strategic human corrections while leveraging LLM's correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that Sargy reaches oracle-level alignment with only 15–20\% of the human annotation effort. Moreover, models trained on Sargy's filtered datasets for downstream tasks achieve performance on par with oracle models.
Paperid:2826
Authors:Qiwei Di · Jiafan He · Quanquan Gu
Abstract:
Abstract:Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM). However, the effectiveness of this approach can be influenced by adversaries, who may intentionally provide misleading preferences to manipulate the output in an undesirable or harmful direction.To tackle this challenge, we study a specific model within this problem domaincontextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandits (\algo), which is based on uncertainty-weighted maximum likelihood estimation. Our algorithm achieves an $\tilde O(d\sqrt{T}/\kappa+dC/\kappa)$ regret bound, where $T$ is the number of rounds, $d$ is the dimension of the context, $\kappa$ is the lower bound of the derivative of the link function, and $ 0 \le C \le T$ is the total number of adversarial feedback. We also prove a lower bound to show that our regret bound is nearly optimal, both in scenarios with and without ($C=0$) adversarial feedback. Our work is the first to achieve nearly minimax optimal regret for dueling bandits in the presence of adversarial preference feedback. Additionally, for the sigmoid link function, we develop a novel algorithm that takes into account the effect of local derivatives into maximum likelihood estimation (MLE) analysis through a refined method for estimating the link function's derivative. This method helps us to eliminate the $\kappa$ dependence in the leading term with respect to $T$, which reduces the exponential dependence on the parameter radius $B$ to a polynomial dependence. We conduct experiments to evaluate our proposed algorithm \algo \ against various types of adversarial feedback. Experimental results demonstrate its superiority over the state-of-the-art dueling bandit algorithms in the presence of adversarial feedback.
Paperid:2827
Authors:Tian Bian · Yifan Niu · Chaohao Yuan · Chengzhi Piao · Bingzhe Wu · Long-Kai Huang · Yu Rong · Tingyang Xu · Hong Cheng · Jia Li
Abstract:
Circuit discovery has recently attracted attention as a potential research direction to explain the nontrivial behaviors of language models. It aims to find the computational subgraphs, also known as circuits, within the model that are responsible for solving specific tasks. However, most existing studies overlook the holistic nature of these circuits and require designing specific corrupted activations for different tasks, which is inaccurate and inefficient. In this work, we propose an end-to-end approach based on the principle of Information Bottleneck, called IBCircuit, to holistically identify informative circuits. In contrast to traditional causal interventions, IBCircuit is an optimization framework for holistic circuit discovery and can be applied to any given task without tediously corrupted activation design. In both the Indirect Object Identification (IOI) and Greater-Than tasks, IBCircuit identifies more faithful and minimal circuits in terms of critical node components and edge components compared to recent related work.
Paperid:2828
Authors:Haoming Jing · Yorie Nakahira
Abstract:
Many systems contain latent variables that make their dynamics partially unidentifiable or cause distribution shifts in the observed statistics between offline and online data. However, existing control techniques often assume access to complete dynamics or perfect simulators with fully observable states, which are necessary to verify whether the system remains within a safe set (forward invariance) or safe actions are consistently feasible at all times. To address this limitation, we propose a technique for designing probabilistic safety certificates for systems with latent variables. A key technical enabler is the formulation of invariance conditions in probability space, which can be constructed using observed statistics in the presence of distribution shifts due to latent variables. We use this invariance condition to construct a safety certificate that can be implemented efficiently in realtime control. The proposed safety certificate can continuously find feasible actions that control long-term risk to stay within tolerance. Stochastic safe control and (causal) reinforcement learning have been studied in isolation until now. To the best of our knowledge, the proposed work is the first to use causal reinforcement learning to quantify long-term risk for the design of safety certificates. This integration enables safety certificates to efficiently ensure long-term safety in the presence of latent variables. The effectiveness of the proposed safety certificate is demonstrated in numerical simulations.
Paperid:2829
Authors:HUAKUN LUO · Haixu Wu · Hang Zhou · Lanxiang Xing · Yichen Di · Jianmin Wang · Mingsheng Long
Abstract:
Abstract:Although deep models have been widely explored in solving partial differential equations (PDEs), previous works are primarily limited to data only with up to tens of thousands of mesh points, far from the millionpoint scale required by industrial simulations that involve complex geometries. In the spirit of advancing neural PDE solvers to real industrial applications, we present Transolver++, a highly parallel and efficient neural solver that can accurately solve PDEs on million-scale geometries. Building upon previous advancements in solving PDEs by learning physical states via Transolver, Transolver++ is further equipped with an extremely optimized parallelism framework and a local adaptive mechanism to efficiently capture eidetic physical states from massive mesh points, successfully tackling the thorny challenges in computation and physics learning when scaling up input mesh size. Transolver++ increases the single-GPU input capacity to million-scale points for the first time and is capable of continuously scaling input size in linear complexity by increasing GPUs. Experimentally, Transolver++ yields 13\% relative promotion across six standard PDE benchmarks and achieves over 20\% performance gain in million-scale high-fidelity industrial simulations, whose sizes are 100$\times$ larger than previous benchmarks, covering car and 3D aircraft designs.
Paperid:2830
Authors:Jusheng Zhang · Zimeng Huang · Yijia Fan · Ningyuan Liu · Mingyan Li · Zhuojie Yang · Jiawei Yao · Jian Wang · Keze Wang
Abstract:
As scaling large language models faces prohibitive costs, multiagent systems emerge as a promising alternative, though challenged by static knowledge assumptions and coordination inefficiencies. We introduces Knowledge-Aware Bayesian Bandits (KABB), a novel framework that enhances multi-agent system coordination through semantic understanding and dynamic adaptation. The framework features three key innovations: a three-dimensional knowledge distance model for deep semantic understanding, a dual-adaptation mechanism for continuous expert optimization, and a knowledge-aware Thompson Sampling strategy for efficient expert selection. Extensive evaluation demonstrates KABB achieves an optimal cost-performance balance, maintaining high performance while keeping computational demands relatively low in multi-agent coordination.
Paperid:2831
Authors:Willem Diepeveen · Georgios Batzolis · Zakhar Shumaylov · Carola-Bibiane Schönlieb
Abstract:
Datadriven Riemannian geometry has emerged as a powerful tool for interpretable representation learning, offering improved efficiency in downstream tasks. Moving forward, it is crucial to balance cheap manifold mappings with efficient training algorithms. In this work, we integrate concepts from pullback Riemannian geometry and generative models to propose a framework for data-driven Riemannian geometry that is scalable in both geometry and learning: score-based pullback Riemannian geometry. Focusing on unimodal distributions as a first step, we propose a score-based Riemannian structure with closed-form geodesics that pass through the data probability density. With this structure, we construct a Riemannian autoencoder (RAE) with error bounds for discovering the correct data manifold dimension. This framework can naturally be used with anisotropic normalizing flows by adopting isometry regularization during training. Through numerical experiments on diverse datasets, including image data, we demonstrate that the proposed framework produces high-quality geodesics passing through the data support, reliably estimates the intrinsic dimension of the data manifold, and provides a global chart of the manifold. To the best of our knowledge, this is the first scalable framework for extracting the complete geometry of the data manifold.
Paperid:2832
Authors:Shu Wei · Yanjie Li · Lina Yu · Min Wu · Linjun Sun · Weijun Li · Jingyi Liu · Hong Qin · Deng Yusong · Jufeng Han · Yan Pang
Abstract:
The pursuit of analytical solutions for differential equations has historically been limited by the need for extensive prior knowledge and mathematical prowess; however, machine learning methods like genetic algorithms have recently been applied to this end, albeit with issues of significant time consumption and complexity. This paper presents a novel machine learningbased solver,SSDE(Symbolic Solver for Differential Equations), which employs reinforcement learning to derive symbolic closed-form solutions for various differential equations. Our evaluations on a range of ordinary and partial differential equations demonstrate that SSDE provides superior performance in achieving analytical solutions compared to other machine learning approaches.
Paperid:2833
Authors:Dhruv Rohatgi · Tanya Marwah · Zachary Lipton · Jianfeng Lu · Ankur Moitra · Andrej Risteski
Abstract:
Graph neural networks (GNNs) are the dominant approach to solving machine learning problems defined over graphs. Despite much theoretical and empirical work in recent years, our understanding of finergrained aspects of architectural design for GNNs remains impoverished. In this paper, we consider the benefits of architectures that maintain and update edge embeddings. On the theoretical front, under a suitable computational abstraction for a layer in the model, as well as memory constraints on the embeddings, we show that there are natural tasks on graphical models for which architectures leveraging edge embeddings can be much shallower. Our techniques are inspired by results on time-space tradeoffs in theoretical computer science. Empirically, we show architectures that maintain edge embeddings almost always improve on their node-based counterparts---frequently significantly so in topologies that have "hub" nodes.
Paperid:2834
Authors:Varun Babbar · Hayden McTavish · Cynthia Rudin · Margo Seltzer
Abstract:
Decision tree optimization is fundamental to interpretable machine learning. The most popular approach is to greedily search for the best feature at every decision point, which is fast but provably suboptimal. Recent approaches find the global optimum using branch and bound with dynamic programming, showing substantial improvements in accuracy and sparsity at great cost to scalability. An ideal solution would have the accuracy of an optimal method and the scalability of a greedy method. We introduce a family of algorithms called SPLIT (SParse Lookahead for Interpretable Trees) that moves us significantly forward in achieving this ideal balance. We demonstrate that not all subproblems need to be solved to optimality to find high quality trees; greediness suffices near the leaves. Since each depth adds an exponential number of possible trees, this change makes our algorithms orders of magnitude faster than existing optimal methods, with negligible loss in performance. We extend this algorithm to allow scalable computation of sets of near-optimal trees (i.e., the Rashomon set).
Paperid:2835
Authors:Jasper Dekoninck · Maximilian Baader · Martin Vechev
Abstract:
The availability of a wide range of large language models (LLMs) embedded in various agentic systems has significantly increased the potential of model selection strategies to improve the costperformance tradeoff. Existing strategies involve either routing, where a single model is chosen per query, or cascading, which sequentially runs increasingly larger models until a satisfactory answer is found. However, current approaches face three key limitations: they (1) lack formal proofs of optimality, (2) fail to identify the conditions under which these strategies are most effective to improve the cost-performance tradeoff, and (3) are unable to combine both paradigms for further improvements. To address these issues, we first derive a novel optimal strategy for cascading and prove the optimality of an existing routing strategy. Further, we proposecascade routing, a unified framework that integrates routing and cascading into a theoretically optimal strategy. Through our analysis, we identify good quality estimators as the critical factor for the success of model selection paradigms. Finally, in our experiments, we show that cascade routing consistently outperforms the individual approaches by a large margin and we analyze quality estimators to determine when routing and/or cascading are useful paradigms for model selection.
Paperid:2836
Authors:Shanchuan Lin · Xin Xia · Yuxi Ren · Ceyuan Yang · Xuefeng Xiao · Lu Jiang
Abstract:
The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for onestep generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model can generate two-second, 1280x720, 24fps videos in real-time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
Paperid:2837
Authors:Qinyu Zhao · Ming Xu · Kartik Gupta · Akshay Asthana · Liang Zheng · Stephen Gould
Abstract:
Abstract:Evaluating large visionlanguage models (LVLMs) is very expensive, due to the high computational costs and the wide variety of tasks. The good news is that if we already have some observed performance scores, we may be able to infer unknown ones. In this study, we propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks. We first formulate the performance prediction as a matrix completion task. Specifically, we construct a sparse performance matrix $\boldsymbol{R}$, where each entry $R_{mn}$ represents the performance score of the $m$-th model on the $n$-th dataset. By applying probabilistic matrix factorization (PMF) with Markov chain Monte Carlo (MCMC), we can complete the performance matrix, that is, predict unknown scores. Additionally, we estimate the uncertainty of performance prediction based on MCMC. Practitioners can evaluate their models on untested tasks with higher uncertainty first, quickly reducing errors in performance prediction. We further introduce several improvements to enhance PMF for scenarios with sparse observed performance scores. Our experiments demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data.
Paperid:2838
Authors:Muskan Dosi · Chiranjeev Chiranjeev · Kartik Thakral · Mayank Vatsa · Richa Singh
Abstract:
Do contemporary diffusion models preserve the class geometry of hyperspherical data?Standard diffusion models rely on isotropic Gaussian noise in the forward process, inherently favoring Euclidean spaces. However, many realworld problems involve non-Euclidean distributions, such as hyperspherical manifolds, where class-specific patterns are governed by angular geometry within hypercones. When modeled in Euclidean space, these angular subtleties are lost, leading to suboptimal generative performance. To address this limitation, we introduce directional noise aligned with hyperspherical structures, preserving class geometry and effectively capturing angular uncertainty. Both theoretically and empirically, we demonstrate that this approach aligns the generative process with the intrinsic geometry of hyperspherical data, resulting in more accurate and geometry-aware generative models. We evaluate our framework on four object datasets and two face datasets, showing that incorporating angular uncertainty better preserves the underlying hyperspherical manifold. Resources are available at: \href{https://anonymousillusion.github.io/anon-icml/}{Anonymous Link.}
Paperid:2839
Authors:Yunbei Zhang · Akshay Mehra · Shuaicheng Niu · Jihun Hamm
Abstract:
Continual TestTime Adaptation (CTTA) seeks to adapt source pre-trained models to continually changing, unseen target domains. While existing CTTA methods assume structured domain changes with uniform durations, real-world environments often exhibit dynamic patterns where domains recur with varying frequencies and durations. Current approaches, which adapt the same parameters across different domains, struggle in such dynamic conditions—they face convergence issues with brief domain exposures, risk forgetting previously learned knowledge, or misapplying it to irrelevant domains. To remedy this, we propose DPCore, a method designed for robust performance across diverse domain change patterns while ensuring computational efficiency. DPCore integrates three key components: Visual Prompt Adaptation for efficient domain alignment, a Prompt Coreset for knowledge preservation, and a Dynamic Update mechanism that intelligently adjusts existing prompts for similar domains while creating new ones for substantially different domains. Extensive experiments on four benchmarks demonstrate that DPCore consistently outperforms various CTTA methods, achieving state-of-the-art performance in both structured and dynamic settings while reducing trainable parameters by 99\% and computation time by 64\% compared to previous approaches.
Paperid:2840
Authors:Hao Li · Hao Wan · Yuzhou Chen · Dongsheng Ye · Yulia Gel · Hao Jiang
Abstract:
Dynamic graphs evolve continuously, presenting challenges for traditional graph learning due to their changing structures and temporal dependencies. Recent advancements have shown potential in addressing these challenges by developing suitable metalearning-based dynamic graph neural network models. However, most meta-learning approaches for dynamic graphs rely on fixed weight update parameters, neglecting the essential intrinsic complex high-order topological information of dynamically evolving graphs. We have designed Dowker Zigzag Persistence (DZP), an efficient and stable dynamic graph persistent homology representation method based on Dowker complex and zigzag persistence, to capture the high-order features of dynamic graphs. Armed with the DZP ideas, we propose TMetaNet, a new meta-learning parameter update model based on dynamic topological features. By utilizing the distances between high-order topological features, TMetaNet enables more effective adaptation across snapshots. Experiments on real-world datasets demonstrate TMetaNet's state-of-the-art performance and resilience to graph noise, illustrating its high potential for meta-learning and dynamic graph analysis.
Paperid:2841
Authors:Kuan Po Huang · Shu-wen Yang · Huy Phan · Bo-Ru Lu · Byeonggeun Kim · Sashank Macha · Qingming Tang · Shalini Ghosh · Hung-yi Lee · Chieh-Chi Kao · Chao Wang
Abstract:
Textto-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state-of-the-art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However, its audio quality still lags behind that of diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fréchet Distance (FD) and Fréchet Audio Distance (FAD) while significantly reducing latency compared to prior models.
Paperid:2842
Authors:Taeckyung Lee · Sorn Chottananurak · Junsu Kim · Jinwoo Shin · Taesik Gong · Sung-Ju Lee
Abstract:
Deep learning models perform poorly when domain shifts exist between training and test data. Testtime adaptation (TTA) is a paradigm to mitigate this issue by adapting pre-trained models using only unlabeled test samples. However, existing TTA methods can fail under severe domain shifts, while recent active TTA approaches requiring full-class labels are impractical due to high labeling costs. To address this issue, we introduce a new setting of TTA with binary feedback, which uses a few binary feedbacks from annotators to indicate whether model predictions are correct, thereby significantly reducing the labeling burden of annotators. Under the setting, we propose BiTTA, a novel dual-path optimization framework that leverages reinforcement learning to balance binary feedback-guided adaptation on uncertain samples with agreement-based self-adaptation on confident predictions. Experiments show BiTTA achieves substantial accuracy improvements over state-of-the-art baselines, demonstrating its effectiveness in handling severe distribution shifts with minimal labeling effort.
Paperid:2843
Authors:Yinghui Li · Jiayi Kuang · Haojing Huang · Zhikun Xu · Xinnian Liang · Yi Yu · Wenlian Lu · Yangning Li · Xiaoyu Tan · Chao Qu · Ying Shen · Hai-Tao Zheng · Philip Yu
Abstract:
Leveraging mathematical Large Language Models (LLMs) for proof generation is a fundamental topic in LLMs research. We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training. This reliance limits their deeper understanding of mathematical theorems and related concepts. Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, our work aims to enhance LLMs’ ability to conduct mathematical reasoning and proof through counterexamples. Specifically, we manually create a highquality, university-level mathematical benchmark, COUNTERMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. Additionally, we develop a data engineering framework to automatically obtain training data for further model improvement. Extensive experiments and detailed analyses demonstrate that COUNTERMATH is challenging, indicating that LLMs, such as OpenAI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs' counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. We believe that our work offers new perspectives on the community of mathematical LLMs.
Paperid:2844
Authors:Yuan Gao · Hao Wu · Ruiqi Shu · huanshuo dong · Fan Xu · Rui Chen · Yibo Yan · Qingsong Wen · Xuming Hu · Kun Wang · Jiahao Wu · Qing Li · Hui Xiong · Xiaomeng Huang
Abstract:
Accurate weather forecasts are important for disaster prevention, agricultural planning, and water resource management. Traditional numerical weather prediction (NWP) methods offer physically interpretable highaccuracy predictions but are computationally expensive and fail to fully leverage rapidly growing historical data. In recent years, deep learning methods have made significant progress in weather forecasting, but challenges remain, such as balancing global and regional high-resolution forecasts, excessive smoothing in extreme event predictions, and insufficient dynamic system modeling. To address these issues, this paper proposes a global-regional nested weather forecasting framework based on graph neural networks (GNNs). By combining a dynamic system perspective with multi-grid theory, we construct a multi-scale graph structure and densify the target region to capture local high-frequency features. We introduce an adaptive information propagation mechanism, using dynamic gating units to deeply integrate node and edge features for more accurate extreme event forecasting. For high-resolution regional forecasts, we propose a neural nested grid method to mitigate boundary information loss. Experimental results show that the proposed method performs excellently across global to regional scales and short-term to long-term forecasts, especially in extreme event predictions (e.g., typhoons), significantly improving forecast accuracy. Our codes are available at \url{https://anonymous.4open.science/r/Anonymous-EE6B}.
Paperid:2845
Authors:Deng · Ruidi Chang · Hanjie Chen
Abstract:
Interventions in language models (LMs) are frequently employed in interpretability research to modify model behavior during the forward pass. Learnable interventions, also known as representation finetuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform optimally in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-level interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-level interventions provide a more comprehensive method for steering model behavior, advancing interpretability research, and enabling finer-grained control over language models.
Paperid:2846
Authors:Guiomar Pescador Barrios · Sarah Filippi · Mark van der Wilk
Abstract:
Many machine learning models require setting a parameter that controls their size before training, e.g. number of neurons in DNNs, or inducing points in GPs. Increasing capacity typically improves performance until all the information from the dataset is captured. After this point, computational cost keeps increasing, without improved performance. This leads to the question "How big is big enough?" We investigate this problem for Gaussian processes (singlelayer neural networks) in continual learning. Here, data becomes available incrementally, and the final dataset size will therefore not be known before training, preventing the use of heuristics for setting a fixed model size. We develop a method to automatically adjust model size while maintaining near-optimal performance. Our experimental procedure follows the constraint that any hyperparameters must be set without seeing dataset properties, and we show that our method performs well across diverse datasets without the need to adjust its hyperparameter, showing it requires less tuning than others.
Paperid:2847
Authors:Thomas Fel · Ekdeep Singh Lubana · Jacob Prince · Matthew Kowal · Victor Boutin · Isabel Papadimitriou · Binxu Wang · Martin Wattenberg · Demba Ba · Talia Konkle
Abstract:
Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, humaninterpretable concepts. However, we reveal a fundamental limitation: SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the data’s convex hull. This geometric anchoring significantly enhances the stability and plausibility of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover “true” classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.
Paperid:2848
Authors:Roy Uziel · Irit Chelly · Oren Freifeld · Ari Pakman
Abstract:
Diffusion models, widely recognized for their success in generative tasks, have not been applied yet to clustering. We introduce Clustering via Diffusion (CLUDI), a selfsupervised framework that combines the generative power of diffusion models with pre-trained Vision Transformer features to achieve robust and accurate clustering. CLUDI is trained via a teacher–student paradigm: the teacher uses stochastic diffusion-based sampling to produce diverse cluster assignments, which the student refines into stable predictions. This stochasticity acts as a novel data augmentation strategy, enabling CLUDI to uncover intricate structures in high-dimensional data. Extensive evaluations on challenging datasets demonstrate that CLUDI achieves state-of-the-art performance in unsupervised classification, setting new benchmarks in clustering robustness and adaptability to complex data distributions. Our code will be released upon acceptance.
Paperid:2849
Authors:Yalan Qin · Nan Pu · Guorui Feng · Nicu Sebe
Abstract:
As a leading unsupervised classification algorithm in artificial intelligence, multiview subspace clustering segments unlabeled data from different subspaces. Recent works based on the anchor have been proposed to decrease the computation complexity for the datasets with large scales in multi-view clustering. The major differences among these methods lie on the objective functions they define. Despite considerable success, these works pay few attention to guaranting the robustness of learned consensus anchors via effective manner for efficient multi-view clustering and investigating the specific local distribution of cluster in the affine subspace. Besides, the robust consensus anchors as well as the common cluster structure shared by different views are not able to be simultaneously learned. In this paper, we propose Robust Consensus anchors learning for efficient multi-view Subspace Clustering (RCSC). We first show that if the data are sufficiently sampled from independent subspaces, and the objective function meets some conditions, the achieved anchor graph has the block-diagonal structure. As a special case, we provide a model based on Frobenius norm, non-negative and affine constraints in consensus anchors learning, which guarantees the robustness of learned consensus anchors for efficient multi-view clustering and investigates the specific local distribution of cluster in the affine subspace. While it is simple, we theoretically give the geometric analysis regarding the formulated RCSC. The union of these three constraints is able to restrict how each data point is described in the affine subspace with specific local distribution of cluster for guaranting the robustness of learned consensus anchors. RCSC takes full advantages of correlation among consensus anchors, which encourages the grouping effect and groups highly correlated consensus anchors together with the guidance of view-specific projection. The anchor graph construction, partition and robust anchor learning are jointly integrated into a unified framework. It ensures the mutual enhancement for these procedures and helps lead to more discriminative consensus anchors as well as the cluster indicator. We then adopt an alternative optimization strategy for solving the formulated problem. Experiments performed on eight multi-view datasets confirm the superiority of RCSC based on the effectiveness and efficiency.
Paperid:2850
Authors:Haotian Wu · Gongpu Chen · Pier Luigi Dragotti · Deniz Gunduz
Abstract:
We introduce and validate the lottery codec hypothesis, which states that untrained subnetworks within randomly initialized networks can serve as synthesis networks for overfitted image compression, achieving ratedistortion (RD) performance comparable to trained networks. This hypothesis leads to a new paradigm for image compression by encoding image statistics into the network substructure. Building on this hypothesis, we propose LotteryCodec, which overfits a binary mask to an individual image, leveraging an over-parameterized and randomly initialized network shared by the encoder and the decoder. To address over-parametrization challenges and streamline subnetwork search, we develop a rewind modulation mechanism that improves the RD performance. LotteryCodec outperforms VTM and sets a new state-of-the-art in single-image compression. LotteryCodec also enables adaptive decoding complexity through adjustable mask ratios, offering flexible compression solutions for diverse device constraints and application requirements.
Paperid:2851
Authors:Saumya Gaurang Shah · Abishek Sankararaman · Balakrishnan Narayanaswamy · Vikramank Singh
Abstract:
Can we efficiently choose the best Anomaly Detection (AD) algorithm for a datastream without requiring anomaly labels? Streaming anomaly detection is hard. SOTA AD algorithms are sensitive to their hyperparameters and no single method works well on all datasets. The best algorithm/hyper-parameter combination for a given data-stream can change over time with data drift. 'What is an anomaly?' is often application, context and dataset dependent. We propose \mymethod (Streaming Ensemble of Anomaly Detectors), the first model selection algorithm for streaming, unsupervised AD. All prior AD model selection algorithms are either supervised, or only work in the offline setting when all data from the test set is available upfront. We show that \mymethod is {\em(i)} unsupervised, i.e., requires no true anomaly labels, {\em(ii)} efficiently implementable in a streaming setting, {\em (iii)} agnostic to the choice of the base algorithms among which it chooses from, and {\em (iv)} adaptive to non-stationarity in the data-stream. Experiments on 14 non-trivial public datasets and an internal dataset corroborate our claims.
Paperid:2852
Authors:KA HIM WONG · Jicheng Zhou · Jiantao Zhou · Yain-Whar Si
Abstract:
The rise of LLMs has increased concerns over source tracing and copyright protection for AIGC, highlighting the need for advanced detection technologies. Passive detection methods usually face high false positives, while active watermarking techniques using logits or sampling manipulation offer more effective protection. Existing LLM watermarking methods, though effective on unaltered content, suffer significant performance drops when the text is modified and could introduce biases that degrade LLM performance in downstream tasks. These methods fail to achieve an optimal tradeoff between text quality and robustness, particularly due to the lack of endto-end optimization of the encoder and decoder. In this paper, we introduce a novel end-to-end logits perturbation method for watermarking LLM-generated text. By jointly optimization, our approach achieves a better balance between quality and robustness. To address non-differentiable operations in the end-to-end training pipeline, we introduce an online prompting technique that leverages the on-the-fly LLM as a differentiable surrogate. Our method achieves superior robustness, outperforming distortion-free methods by 37–39% under paraphrasing and 17.2% on average, while maintaining text quality on par with these distortion-free methods in terms of text perplexity and downstream tasks. Our method can be easily generalized to different LLMs. The code is provided in the supplementary material.
Paperid:2853
Authors:Yibo Xu · Dawei Zhou · Decheng Liu · Nannan Wang
Abstract:
Deep neural networks are found to be vulnerable to adversarial noises. The promptbased defense has been increasingly studied due to its high efficiency. However, existing prompt-based defenses mainly exploited mixed prompt patterns, where critical patterns closely related to object semantics lack sufficient focus. The phase and amplitude spectra have been proven to be highly related to specific semantic patterns and crucial for robustness. To this end, in this paper, we propose a Phase and Amplitude-aware Prompting (PAP) defense. Specifically, we construct phase-level and amplitude-level prompts for each class, and adjust weights for prompting according to the model's robust performance under these prompts during training. During testing, we select prompts for each image using its predicted label to obtain the prompted image, which is inputted to the model to get the final prediction. Experimental results demonstrate the effectiveness of our method.
Paperid:2854
Authors:Yeseul Cho · Baekrok Shin · Changmin Kang · Chulhee Yun
Abstract:
Recent advances in deep learning rely heavily on massive datasets, leading to substantial storage and training costs. Dataset pruning aims to alleviate this demand by discarding redundant examples. However, many existing methods require training a model with a full dataset over a large number of epochs before being able to prune the dataset, which ironically makes the pruning process more expensive than just training the model on the entire dataset. To overcome this limitation, we introduce a Difficulty and UncertaintyAware Lightweight (DUAL) score, which aims to identify important samples from the early training stage by considering both example difficulty and prediction uncertainty. To address a catastrophic accuracy drop at an extreme pruning ratio, we further propose a pruning ratio-adaptive sampling using Beta distribution.Experiments on various datasets and learning scenarios such as image classification with label noise and image corruption, and model architecture generalization demonstrate the superiority of our method over previous state-of-the-art (SOTA) approaches. Specifically, on ImageNet-1k, our method reduces the time cost for pruning to 66\% compared to previous methods while achieving a SOTA 60\% test accuracy at a 90\% pruning ratio. On CIFAR datasets, the time cost is reduced to just 15\% while maintaining SOTA performance.
Paperid:2855
Authors:Tianci Bu · Le Zhou · Wenchuan Yang · Jianhong Mou · Kang Yang · Suoyi Tan · Feng Yao · Jingyuan Wang · Xin Lu
Abstract:
Trajectory data is crucial for various applications but often suffers from incompleteness due to device limitations and diverse collection scenarios. Existing imputation methods rely on sparse trajectory or travel information, such as velocity, to infer missing points. However, these approaches assume that sparse trajectories retain essential behavioral patterns, which place significant demands on data acquisition and overlook the potential of largescale human trajectory embeddings.To address this, we propose ProDiff, a trajectory imputation framework that uses only two endpoints as minimal information. It integrates prototype learning to embed human movement patterns and a denoising diffusion probabilistic model for robust spatiotemporal reconstruction. Joint training with a tailored loss function ensures effective imputation.ProDiff outperforms state-of-the-art methods, improving accuracy by 6.28\% on FourSquare and 2.52\% on WuXi. Further analysis shows a 0.927 correlation between generated and real trajectories, demonstrating the effectiveness of our approach.
Paperid:2856
Authors:Masatoshi Uehara · su · Yulai Zhao · Xiner Li · Aviv Regev · Shuiwang Ji · Sergey Levine · Tommaso Biancalani
Abstract:
To fully leverage the capabilities of diffusion models, we are often interested in optimizing downstream reward functions during inference. While numerous algorithms for rewardguided generation have been recently proposed due to their significance, current approaches predominantly focus on single-shot generation, transitioning from fully noised to denoised states. We propose a novel framework for inference-time reward optimization with diffusion models. Our approach employs an iterative refinement process consisting of two steps in each iteration: noising and reward-guided denoising. This sequential refinement allows for the gradual correction of errors introduced during reward optimization. Finally, we provide a theoretical guarantee for our framework. Finally, we demonstrate its superior empirical performance in protein and DNA design.
Paperid:2857
Authors:Zhaoyi Zhou · Yuda Song · Andrea Zanette
Abstract:
Abstract:Accurately evaluating large language models (LLMs) is crucial for ensuring their reliability and safety. Among various approaches, human evaluation is the gold standard, especially when we need to capture nuanced qualities like coherence, readability and alignment with human expectations. However, such evaluations are often resourceintensive and costly, posing challenges for scalability. To reduce the cost, previous academic benchmarks employ other LLMs to create synthetic feedback, introducing bias into the evaluation. Towards accurate and economic LLM evaluation, in this work, we introduce a statistically-inspired estimator that requires a reduced number of human annotations via synthetic feedback. Our experiments demonstrate a reduction in human annotations by up to $12.2$\% with an off-the-shelf synthetic evaluator, and up to $24.8$\% with a finetuned variant. Apart from being general and scalable, our proposed method enjoys predictable saving performance.
Paperid:2858
Authors:Edoardo Botta · Yuchen Li · Aashay Mehta · Jordan Ash · Cyril Zhang · Andrej Risteski
Abstract:
Recently, a plethora of works have proposedinferencetime algorithms (e.g. best-of-n), whichincorporate verifiers to assist the generation process. Their quality-efficiency trade-offs have beenempirically benchmarked on a variety of constrained generation tasks, but the algorithmic design landscape is still largely poorly understood.In this paper, we develop a mathematical framework for reasoning about constrained generationusing a pre-trained language model generator oracle and a process verifier—which can decidewhether a prefix can be extended to a string whichsatisfies the constraints of choice. We show thateven in very simple settings, access to a verifiercan render an intractable problem (information-theoretically or computationally) to a tractableone. In fact, we show even simple algorithms,like tokenwise rejection sampling, can enjoy significant benefits from access to a verifier. Empirically, we show that a natural modification oftokenwise rejection sampling, in which the sampler is allowed to "backtrack" (i.e., erase the finalfew generated tokens) has robust and substantivebenefits over natural baselines (e.g. (blockwise)rejection sampling, nucleus sampling)—both interms of computational efficiency, accuracy anddiversity.
Paperid:2859
Authors:Shaobin Zhuang · Yiwei Guo · Yanbo Ding · Kunchang Li · Xinyuan Chen · Yaohui Wang · Fangyikang Wang · Ying Zhang · Chen Li · Yali Wang
Abstract:
Diffusion models have driven the advancement of vision generation over the past years. However, it is often difficult to apply these large models in downstream tasks, due to massive finetuning cost. Recently, Low-Rank Adaptation (LoRA) has been applied for efficient tuning of diffusion models. Unfortunately, the capabilities of LoRA-tuned diffusion models are limited, since the same LoRA is used for different timesteps of the diffusion process. To tackle this problem, we introduce a general and concise TimeStep Master (TSM) paradigm with two key fine-tuning stages. In the fostering stage (1-stage), we apply different LoRAs to fine-tune the diffusion model at different timestep intervals. This results in different TimeStep LoRA experts that can effectively capture different noise levels. In the assembling stage (2-stage), we design a novel asymmetrical mixture of TimeStep LoRA experts, via core-context collaboration of experts at multi-scale intervals. For each timestep, we leverage TimeStep LoRA expert within the smallest interval as the core expert without gating, and use experts within the bigger intervals as the context experts with time-dependent gating. Consequently, our TSM can effectively model the noise level via the expert in the finest interval, and adaptively integrate contexts from the experts of other scales, boosting the versatility of diffusion models. To show the effectiveness of our TSM paradigm, we conduct extensive experiments on three typical and popular LoRA-related tasks of diffusion models, including domain adaptation, post-pretraining, and model distillation. Our TSM achieves the state-of-the-art results on all these tasks, throughout various model structures (UNet, DiT and MM-DiT) and visual data modalities (Image, Video), showing its remarkable generalization capacity.
Paperid:2860
Authors:chang liu · Jiangrong Shen · Xuming Ran · Mingkun Xu · Qi Xu · Yi Xu · Gang Pan
Abstract:
Artificial neural networks (ANNs) have demonstrated outstanding performance in numerous tasks, but deployment in resourceconstrained environments remains a challenge due to their high computational and memory requirements. Spiking neural networks (SNNs) operate through discrete spike events and offer superior energy efficiency, providing a bio-inspired alternative. However, current ANN-to-SNN conversion often results in significant accuracy loss and increased inference time due to conversion errors such as clipping, quantization, and uneven activation. This paper proposes a novel ANN-to-SNN conversion framework based on error compensation learning. We introduce a learnable threshold clipping function, dual-threshold neurons, and an optimized membrane potential initialization strategy to mitigate the conversion error. Together, these techniques address the clipping error through adaptive thresholds, dynamically reduce the quantization error through dual-threshold neurons, and minimize the non-uniformity error by effectively managing the membrane potential. Experimental results on CIFAR-10, CIFAR-100, ImageNet datasets show that our method achieves high-precision and ultra-low latency among existing conversion methods. Using only two time steps, our method significantly reduces the inference time while maintains competitive accuracy of 94.75% on CIFAR-10 dataset under ResNet-18 structure. This research promotes the practical application of SNNs on low-power hardware, making efficient real-time processing possible.
Paperid:2861
Authors:Thomas Rupf · Marco Bagatella · Nico Gürtler · Jonas Frey · Georg Martius
Abstract:
Zeroshot imitation learning algorithms hold the promise of reproducing unseen behavior from as little as a single demonstration at test time. Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high-level goal selector, and a low-level goal-conditioned policy. However, this framework can suffer from myopic behavior: the agent's immediate actions towards achieving individual goals may undermine long-term objectives. We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning. We propose to lift a goal-conditioned value function to a distance between occupancies, which are in turn approximated via a learned world model. The resulting method can learn from offline, suboptimal data, and is capable of non-myopic, zero-shot imitation, as we demonstrate in complex, continuous benchmarks.
Paperid:2862
Authors:Zhicheng Zhang · Wuyou Xia · Chenxi Zhao · Zhou Yan · Xiaoqiang Liu · Yongjie Zhu · Wenyu Qin · Pengfei Wan · Di ZHANG · Jufeng Yang
Abstract:
Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by generalizable attention architecture. Advanced methods predominantly focus on languagecentric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks.
Paperid:2863
Authors:Yichen Li · Yuying Wang · Haozhao Wang · Yining Qi · Tianzhe Xiao · Ruixuan Li
Abstract:
Continual Federated Learning (CFL) allows distributed devices to collaboratively learn novel concepts from continuously shifting training data while avoiding \textit{knowledge forgetting} of previously seen tasks. To tackle this challenge, most current CFL approaches rely on extensive rehearsal of previous data. Despite effectiveness, rehearsal comes at a cost to memory, and it may also violate data privacy. Considering these, we seek to apply regularization techniques to CFL by considering their costefficient properties that do not require sample caching or rehearsal. Specifically, we first apply traditional regularization techniques to CFL and observe that existing regularization techniques, especially synaptic intelligence, can achieve promising results under homogeneous data distribution but fail when the data is heterogeneous. Based on this observation, we propose a simple yet effective regularization algorithm for CFL named \textbf{FedSSI}, which tailors the synaptic intelligence for the CFL with heterogeneous data settings. FedSSI can not only reduce computational overhead without rehearsal but also address the data heterogeneity issue. Extensive experiments show that FedSSI achieves superior performance compared to state-of-the-art methods.
Paperid:2864
Authors:Shengbin Ye · Meng Li
Abstract:
Abstract:Symbolic regression (SR) is a powerful technique for discovering symbolic expressions that characterize nonlinear relationships in data, gaining increasing attention for its interpretability, compactness, and robustness. However, existing SR methods do not scale to datasets with a large number of input variables (referred to as extremescale SR), which is common in modern scientific applications. This "large $p$'' setting, often accompanied by measurement error, leads to slow performance of SR methods and overly complex expressions that are difficult to interpret. To address this scalability challenge, we propose a method called PAN+SR, which combines a key idea of ab initio nonparametric variable selection with SR to efficiently pre-screen large input spaces and reduce search complexity while maintaining accuracy. The use of nonparametric methods eliminates model misspecification, supporting a strategy called parametric-assisted nonparametric (PAN). We also extend SRBench, an open-source benchmarking platform, by incorporating high-dimensional regression problems with various signal-to-noise ratios. Our results demonstrate that PAN+SR consistently enhances the performance of 19 contemporary SR methods, enabling several to achieve state-of-the-art performance on these challenging datasets.
Paperid:2865
Authors:Alexander Kolesov · S. Manukhov · Vladimir Palyulin · Aleksandr Korotin
Abstract:
We propose Electrostatic Field Matching (EFM), a novel method that is suitable for both generative modelling and distribution transfer tasks. Our approach is inspired by the physics of an electrical capacitor. We place source and target distributions on the capacitor plates and assign them positive and negative charges, respectively. We then learn the capacitor's electrostatic field using a neural network approximator. To map the distributions to each other, we start at one plate of the capacitor and move the samples along the learned electrostatic field lines until they reach the other plate. We theoretically justify that this approach provably yields the distribution transfer. In practice, we demonstrate the performance of our EFM in toy and image data experiments.
Paperid:2866
Authors:Cheng Huang · Pan Mu · Cong Bai · Peter Watson
Abstract:
Deep learning methods have made significant progress in regular rainfall forecasting, yet the more hazardous tropical cyclone (TC) rainfall has not received the same attention. While regular rainfall models can offer valuable insights for designing TC rainfall forecasting models, most existing methods suffer from cumulative errors and lack physical consistency. Additionally, these methods overlook the importance of meteorological factors in TC rainfall and their integration with the numerical weather prediction (NWP) model. To address these issues, we propose Tropical Cyclone Precipitation Diffusion (TCPDiffusion), a multi-modal model for forecasting of TC precipitation given an existing TC in any location globally. It forecasts rainfall around the TC center for the next 12 hours at 3 hourly resolution based on past rainfall observations and multi-modal environmental variables. Adjacent residual prediction (ARP) changes the training target from the absolute rainfall value to the rainfall trend and gives our model the capability of rainfall change awareness, reducing cumulative errors and ensuring physical consistency. Considering the influence of TC-related meteorological factors and the useful information from NWP model forecasts, we propose a multi-model framework with specialized encoders to extract richer information from environmental variables and results provided by NWP models. The results of extensive experiments show that our method outperforms other DL methods and the NWP method from the European Centre for Medium-Range Weather Forecasts (ECMWF).
Paperid:2867
Authors:Jingyao Wang · Siyu Zhao · Wenwen Qiang · Jiangmeng Li · Changwen Zheng · Fuchun Sun · Hui Xiong
Abstract:
Abstract:MultiModal Learning (MML) aims to learn effective representations across modalities for accurate predictions. Existing methods typically focus on modality consistency and specificity to learn effective representations. However, from a causal perspective, they may lead to representations that contain insufficient and unnecessary information. To address this, we propose that effective MML representations should be causally sufficient and necessary. Considering practical issues like spurious correlations and modality conflicts, we relax the exogeneity and monotonicity assumptions prevalent in prior works and explore the concepts specific to MML, i.e., Causal Complete Cause ($C^3$). We begin by defining $C^3$, which quantifies the probability of representations being causally sufficient and necessary. We then discuss the identifiability of $C^3$ and introduce an instrumental variable to support identifying $C^3$ with non-exogeneity and non-monotonicity. Building on this, we conduct the $C^3$ measurement, i.e., $C^3$ risk. We propose a twin network to estimate it through (i) the real-world branch: utilizing the instrumental variable for sufficiency, and (ii) the hypothetical-world branch: applying gradient-based counterfactual modeling for necessity. Theoretical analyses confirm its reliability. Based on these results, we propose $C^3$ Regularization, a plug-and-play method that enforces the causal completeness of the learned representations by minimizing $C^3$ risk. Extensive experiments demonstrate its effectiveness.
Paperid:2868
Authors:Zhihua Liu · Amrutha Saseendran · Lei Tong · Xilin He · Fariba Yousefi · Nikolay Burlutskiy · Dino Oglic · Tom Diethe · Philip Teare · Huiyu Zhou · Chen Jin
Abstract:
Openset image segmentation poses a significant challenge because existing methods often demand extensive training or fine-tuning and generally struggle to segment unified objects consistently across diverse text reference expressions. Motivated by this, we propose Segment Anyword, a novel training-free visual concept prompt learning approach for open-set language grounded segmentation that relies on token-level cross-attention maps from a frozen diffusion model to produce segmentation surrogates ormask prompts, which are then refined into targeted object masks. Initial prompts typically lack coherence and consistency as the complexity of the image-text increases, resulting in suboptimal mask fragments. To tackle this issue, we further introduce a novel linguistic-guided visual prompt regularization that binds and clusters visual prompts based on sentence dependency and syntactic structural information, enabling the extraction of robust, noise-tolerant mask prompts, and significant improvements in segmentation accuracy. The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) mIoU on Pascal Context 59, 67.73 (+25.73 relative) cIoU on gRefCOCO, and 67.4 (+1.1 relative to fine-tuned methods) mIoU on GranDf, which is the most complex open-set grounded segmentation task in the field.
Paperid:2869
Authors:Jiashu HE · Mingyu Ma · Jinxuan Fan · Dan Roth · Wei Wang · Alejandro Ribeiro
Abstract:
Abstract:Existing approaches based on context prompting or reinforcement learning (RL) to improve the reasoning capacities of large language models (LLMs) depend on the LLMs' internal knowledge to produce reliable ChainOf-Thought (CoT). However, no matter the size of LLMs, certain problems cannot be resolved in a single forward pass. Meanwhile, agent-based reasoning systems require access to a comprehensive nonparametric knowledge base, which is often costly or not feasible for use in scientific and niche domains. We present Graph Inspired Veracity Extrapolation (GIVE), a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input. GIVE guides the LLM agent to select the most pertinent expert data ($\textbf{observe}$), engage in query-specific divergent thinking ($\textbf{reflect}$), and then synthesize this information to produce the final output ($\textbf{speak}$). Extensive experiments demonstrated the following benefits of our framework: (1) GIVE boosts the performance of LLMs across various sizes. (2) In some scenarios, GIVE allows smaller LLMs to surpass larger, more sophisticated ones in scientific tasks ($\textbf{GPT3.5T + GIVE > GPT4}$). (3) GIVE is effective on scientific and open-domain assessments. (4) GIVE is a training-free method that enables LLMs to tackle new problems that extend beyond their training data (up to $\textbf{43.5\\%} \rightarrow \textbf{88.2\\%}$ accuracy improvement). (5) GIVE allows LLM agents to reason using both restricted (very small) and noisy (very large) knowledge sources, accommodating knowledge graphs (KG) ranging from $\textbf{135}$ to more than $\textbf{840k}$ nodes. (6) The reasoning process involved in GIVE is fully interpretable.
Paperid:2870
Authors:Zhangchi Zhao · Jun Shu · Deyu Meng · Zongben Xu
Abstract:
Inspired by the KolmogorovArnold representation theorem, KANs offer a novel framework for function approximation by replacing traditional neural network weights with learnable univariate functions. This design demonstrates significant potential as an efficient and interpretable alternative to traditional MLPs. However, KANs are characterized by a substantially larger number of trainable parameters, leading to challenges in memory efficiency and higher training costs compared to MLPs. To address this limitation, we propose to generate weights for KANs via a smaller meta-learner, called MetaKANs. By training KANs and MetaKANs in an end-to-end differentiable manner, MetaKANs achieve comparable or even superior performance while significantly reducing the number of trainable parameters and maintaining promising interpretability. Extensive experiments on diverse benchmark tasks, including symbolic regression, partial differential equation solving, and image classification, demonstrate the effectiveness of MetaKANs in improving parameter efficiency and memory usage. The proposed method provides an alternative technique for training KANs, that allows for greater scalability and extensibility, and narrows the training cost gap with MLPs stated in the original paper of KANs.
Paperid:2871
Authors:Wayne Chi · Valerie Chen · Anastasios Angelopoulos · Wei-Lin Chiang · Aditya Mittal · Naman Jain · Tianjun Zhang · Ion Stoica · Chris Donahue · Ameet Talwalkar
Abstract:
Evaluating inthe-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no existing solution. We introduce EvalX, a platform to collect user preferences through native integration into a developer's working environment. EvalX comprises a novel interface for comparing pairs of model outputs, a sampling strategy to reduce experienced latency, and a prompting scheme to enable code completion functionality. EvalX has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from EvalX differ from those of existing evaluations, which we attribute to the unique distribution of data and tasks contained in EvalX. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We will open-source EvalX and release data to enable human-centric evaluations and improve understanding of coding assistants.
Paperid:2872
Authors:Xuanshu Luo · Martin Werner
Abstract:
Twodimensional (2D) convolutional kernels have dominated convolutional neural networks (CNNs) in image processing. While linearly scaling 1D convolution provides parameter efficiency, its naive integration into CNNs disrupts image locality, thereby degrading performance. This paper presents path convolution (PathConv), a novel CNN design exclusively with 1D operations, achieving ResNet-level accuracy using only 1/3 parameters. To obtain locality-preserving image traversal paths, we analyze Hilbert/Z-order paths and expose a fundamental trade-off: improved proximity for most pixels comes at the cost of excessive distances for other sacrificed ones to their neighbors. We resolve this issue by proposing path shifting, a succinct method to reposition sacrificed pixels. Using the randomized rounding algorithm, we show that three shifted paths are sufficient to offer better locality preservation than trivial raster scanning. To mitigate potential convergence issues caused by multiple paths, we design a lightweight path-aware channel attention mechanism to capture local intra-path and global inter-path dependencies. Experimental results further validate the efficacy of our method, establishing the proposed 1D PathConv as a viable backbone for efficient vision models.
Paperid:2873
Authors:Yike Yuan · Ziyu Wang · Zihao Huang · Defa Zhu · Qiyang Min · Jingyi Yu · Xun Zhou
Abstract:
Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce RaceDiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.
Paperid:2874
Authors:Masanori Ishikura · Masayuki Karasuyama
Abstract:
This study considers multiobjective Bayesian optimization (MOBO) through the information gain of the Pareto-frontier. To calculate the information gain, a predictive distribution conditioned on the Pareto-frontier plays a key role, which is defined as a distribution truncated by the Pareto-frontier. However, it is usually impossible to obtain the entire Pareto-frontier in a continuous domain, and therefore, the complete truncation cannot be known. We consider an approximation of the truncate distribution by using a mixture distribution consisting of two possible approximate truncation obtainable from a subset of the Pareto-frontier, which we call over- and under-truncation. Since the optimal balance of the mixture is unknown beforehand, we propose optimizing the balancing coefficient through the variational lower bound maximization framework, by which the approximation error of the information gain can be minimized. Our empirical evaluation demonstrates the effectiveness of the proposed method particularly when the number of objective functions is large.
Paperid:2875
Authors:Yi-Fan Zhang · Xiao Zhang · Min-Ling Zhang
Abstract:
Controllability has become a critical issue in trustworthy machine learning, as a controllable learner allows for dynamic model adaptation to task requirements during testing. However, existing research lacks a comprehensive understanding of how to effectively measure and analyze the generalization performance of controllable learning methods. In an attempt to move towards this goal from a generalization perspective, we first establish a unified framework for controllable learning. Then, we develop a novel vectorcontraction inequality and derive a tight generalization bound for general controllable learning classes, which is independent of the number of task targets except for logarithmic factors and represents the current best-in-class theoretical result. Furthermore, we derive generalization bounds for two typical controllable learning methods: embedding-based and hypernetwork-based methods. We also upper bound the Rademacher complexities of commonly used control and prediction functions, which serve as modular theoretical components for deriving generalization bounds for specific controllable learning methods in practical applications such as recommender systems. Our theoretical results without strong assumptions provide general theoretical guarantees for controllable learning methods and offer new insights into understanding controllability in machine learning.
Paperid:2876
Authors:Zipeng Ji · Guanghui Zhu · Chunfeng Yuan · Yihua Huang
Abstract:
LLMto-NAS is a promising field at the intersection of Large Language Models (LLMs) and Neural Architecture Search (NAS), as recent research has explored the potential of architecture generation leveraging LLMs on multiple search spaces. However, the existing LLM-to-NAS methods face the challenges of limited search spaces, time-cost search efficiency, and uncompetitive performance across standard NAS benchmarks and multiple downstream tasks. In this work, we propose the Reflective Zero-cost NAS (RZ-NAS) method that can search NAS architectures with humanoid reflections and training-free metrics to elicit the power of LLMs. We rethink LLMs’ roles in NAS in current work and design a structured, prompt-based to comprehensively understand the search tasks and architectures from both text and code levels. By integrating LLM reflection modules, we use LLM-generated feedback to provide linguistic guidance within architecture optimization. RZ-NAS enables effective search within both micro and macro search spaces without extensive time cost, achieving SOTA performance across multiple downstream tasks.
Paperid:2877
Authors:Xinrui Wang · Shao-Yuan Li · Jiaqiang Zhang · Songcan Chen
Abstract:
MultiLabel Online Continual Learning (MOCL) requires models to learn continuously from endless multi-label data streams, facing complex challenges including persistent catastrophic forgetting, potential missing labels, and uncontrollable imbalanced class distributions. While existing MOCL methods attempt to address these challenges through various techniques, \textit{they all overlook label-specific region identifying and feature learning} - a fundamental solution rooted in multi-label learning but challenging to achieve in the online setting with incremental and partial supervision. To this end, we first leverage the inherent structural information of input data to evaluate and verify the innate localization capability of different pre-trained models. Then, we propose CUTER (CUT-out-and-Experience-Replay), a simple yet versatile strategy that provides fine-grained supervision signals by further identifying, strengthening and cutting out label-specific regions for efficient experience replay. It not only enables models to simultaneously address catastrophic forgetting, missing labels, and class imbalance challenges, but also serves as an orthogonal solution that seamlessly integrates with existing approaches. Extensive experiments on multiple multi-label image benchmarks demonstrate the superiority of proposed our method.
Paperid:2878
Authors:Subham Sekhar Sahoo · Justin Deschenaux · Aaron Gokaslan · Guanghan Wang · Justin Chiu · Volodymyr Kuleshov
Abstract:
Abstract:Discrete diffusions models have been demonstrated to be surprisingly strong language models.In this work, we show that discrete diffusion language models can be further improved by adapting methods from continuousstate diffusion models.We establish a core property of uniform state diffusion: it stems from an underlying Gaussian diffusion process.This property allows us to improve both training by utilizing a curriculum learning strategy that reduces training variance and leads to $\mathbf{2\times}$ faster convergence, as well as sampling by adapting efficient distillation methods from continuous-state diffusion models. As a result, our models surpass an autoregressive model’s zero-shot perplexity on 3 out of 7 benchmarks and we manage to reduce the sampling steps by **two orders of magnitude** while preserving sample quality.
Paperid:2879
Authors:Hongliang Lu · Zhonglin Xie · Yaoyu Wu · Can Ren · Yuxuan Chen · Zaiwen Wen
Abstract:
Despite the rapid development of large language models (LLMs), a fundamental challenge persists: the lack of highquality optimization modeling datasets hampers LLMs' robust modeling of practical optimization problems from natural language descriptions (NL). This data scarcity also contributes to the generalization difficulties experienced by learning-based methods. To address these challenges, we propose a scalable framework for synthesizing a high-quality dataset, named OptMATH. Starting from curated seed data with mathematical formulations (MF), this framework automatically generates problem data (PD) with controllable complexity. Then, a back-translation step is employed to obtain NL. To verify the correspondence between the NL and the PD, a forward modeling step followed by rejection sampling is used. The accepted pairs constitute the training part of OptMATH. Then a collection of rejected pairs is identified and further filtered. This collection serves as a new benchmark for optimization modeling, containing difficult instances whose lengths are much longer than these of NL4OPT and MAMO. Through extensive experiments, we demonstrate that models of various sizes (0.5B-32B parameters) trained on OptMATH achieve superior results on multiple modeling benchmarks, thereby validating the effectiveness and scalability of our approach.
Paperid:2880
Authors:Wenlong Wan · Weiying Zheng · Tianyi Xiang · Guiqing Li · Shengfeng He
Abstract:
Abstract:We introduce the task of Audible Action Temporal Localization, aimed at pinpointing the spatiotemporal coordinates of audible movements. This task diverges from conventional ones such as action recognition and temporal action localization, which broadly analyze relationships in video content but overlook the distinct kinematic dynamics associated with actions. Our work leverages the premise that key action is rooted in inflectional movements, for instance, collisions leading to sound are characterized by sudden changes in motion. To capture this, we propose $TA^{2}Net$, a novel architecture that employs an inflectional flow estimation technique based on motion's second derivative to independently ascertain collision timings without relying on audio input. Additionally, $TA^{2}Net$ integrates a self-supervised spatial localization strategy during training, marrying contrastive learning with spatial analysis insights. This dual approach not only boosts temporal localization accuracy but also precisely locates the sound source within video frames as a secondary capability.To bridge the dataset gap in this area, we compiled a new benchmark dataset, $Audible623$, derived from Kinetics and UCF101 datasets by excluding non-essential vocalization subsets. Extensive experimental validation affirms the efficacy of our methods within our new benchmark dataset and demonstrates their strong generalizability across diverse, unseen datasets, such as those in repetitive counting and sound source localization.
Paperid:2881
Authors:Eric Frankel · Sitan Chen · Jerry Li · Pang Wei Koh · Lillian Ratliff · Sewoong Oh
Abstract:
Abstract:Diffusion models (DMs) create samples from a data distribution by starting from random noise and iteratively solving a reversetime ordinary differential equation (ODE).Because each step in the iterative solution requires an expensive neural function evaluation (NFE), there has been significant interest in approximately solving these diffusion ODEs with only a few NFEs without modifying the underlying model.However, in the few NFE regime, we observe that tracking the true ODE evolution is fundamentally impossible using traditional ODE solvers.In this work, we propose a new method that learns a good solver for the DM, which we call **S**olving **for** the **S**olver (**S4S**).S4S directly optimizes a solver to obtain good generation quality by learning to match the output of a strong teacher solver.We evaluate S4S on six different pre-trained DMs, including pixel-space and latent-space DMs for both conditional and unconditional sampling.In all settings, S4S uniformly improves the sample quality relative to traditional ODE solvers. Moreover, our method is lightweight, data-free, and can be plugged in black-box on top of any discretization schedule or architecture to improve performance.Building on top of this, we also propose **S4S-Alt**, which optimizes both the solver and the discretization schedule.By exploiting the full design space of DM solvers, with 5 NFEs, we achieve an FID of 3.73 on CIFAR10 and 13.26 on MS-COCO, representing a $1.5\times$ improvement over previous training-free ODE methods.
Paperid:2882
Authors:Yuchen Wang · Hongjue Zhao · Haohong Lin · Enze Xu · Lifang He · Huajie Shao
Abstract:
This work aims to address the problem of longterm dynamic forecasting in complex environments where data are noisy and irregularly sampled. While recent studies have introduced some methods to improve prediction performance, these approaches still face a significant challenge in handling long-term extrapolation tasks under such complex scenarios. To overcome this challenge, we propose Phy-SSM, a general-purpose framework that integrates partial physics knowledge into state space models (SSMs) for long-term dynamics forecasting in complex environments. Our motivation is that SSMs can effectively capture long-range dependencies in sequential data and model continuous dynamical systems, while the incorporation of physics knowledge improves generalization ability. The key challenge lies in how to seamlessly incorporate partially known physics into SSMs. To achieve this, we decompose partially known system dynamics into known and unknown state matrices, which are integrated into a Phy-SSM unit. To further enhance long-term prediction performance, we introduce a physics state regularization term to make the estimated latent states align with system dynamics. Besides, we theoretically analyze the uniqueness of the solutions for our method. Extensive experiments on three real-world applications, including vehicle motion prediction, drone state prediction, and COVID-19 epidemiology forecasting, demonstrate the superior performance of Phy-SSM over the baselines in both long-term interpolation and extrapolation tasks. The source code will be publicly available upon publication.
Paperid:2883
Authors:Yue Meng · Chuchu Fan
Abstract:
Learning to solve complex tasks with signal temporal logic (STL) specifications is crucial to many realworld applications. However, most previous works only consider fixed or parametrized STL specifications due to the lack of a diverse STL dataset and encoders to effectively extract temporal logic information for downstream tasks. In this paper, we propose TeLoGraF, Temporal Logic Graph-encoded Flow, which utilizes Graph Neural Networks (GNN) encoder and flow-matching to learn solutions for general STL specifications. We identify four commonly used STL templates and collect a total of 200K specifications with paired demonstrations. We conduct extensive experiments in five simulation environments ranging from simple dynamical models in the 2D space to high-dimensional 7DoF Franka Panda robot arm and Ant quadruped navigation. Results show that our method outperforms other baselines in the STL satisfaction rate. Compared to classical STL planning algorithms, our approach is 10-100X faster in inference and can work on any system dynamics. Besides, we show our graph-encoding method's capability to solve complex STLs and robustness to out-distribution STL specifications.
Paperid:2884
Authors:Lukas Thede · Karsten Roth · Matthias Bethge · Zeynep Akata · Thomas Hartvigsen
Abstract:
Keeping large language models factually upto-date is crucial for deployment, yet costly retraining remains a challenge. Knowledge editing offers a promising alternative, but methods are only tested on small-scale or synthetic edit benchmarks. In this work, we aim to bridge research into lifelong knowledge editing to real-world edits at practically relevant scale. We first introduce \texttt{WikiBigEdit}; a large-scale benchmark of real-world Wikidata edits, built to automatically extend lifelong for future-proof benchmarking. In its first instance, it includes over 500K question-answer pairs for knowledge editing alongside a comprehensive evaluation pipeline. Finally, we use \texttt{WikiBigEdit} to study existing knowledge editing techniques' ability to incorporate large volumes of real-world facts and contrast their capabilities to generic modification techniques such as retrieval augmentation and continual finetuning to acquire a complete picture of the practical extent of current lifelong knowledge editing.
Paperid:2885
Authors:Aleksandr Kolomeitsev · ANH-HUY PHAN
Abstract:
Topology optimization plays a crucial role in designing efficient and manufacturable structures. Traditional methods often yield freeform voids that, although providing design flexibility, introduce significant manufacturing challenges and require extensive post-processing. Conversely, feature-mapping topology optimization reduces post-processing efforts by constructing topologies using predefined geometric features. Nevertheless, existing approaches are significantly constrained by the limited set of geometric features available, the variety of parameters that each type of geometric feature can possess, and the necessity of employing differentiable signed distance functions. In this paper, we present a novel method that combines Neural Heaviside Signed Distance Functions (Heaviside SDFs) with structured latent shape representations to generate manufacturable voids directly within the optimization framework. Our architecture incorporates encoder and decoder networks to effectively approximate the Heaviside function and facilitate optimization within a unified latent space, thus addressing the feature diversity limitations of current feature-mapping techniques. Experimental results validate the effectiveness of our approach in balancing structural compliance, offering a new pathway to CAD-integrated design with minimal human intervention.
Paperid:2886
Authors:Masha Naslidnyk · Siu Lun Chau · Francois-Xavier Briol · Krikamol Muandet
Abstract:
Embedding probability distributions into reproducing kernel Hilbert spaces (RKHS) has enabled powerful nonparametric methods such as the maximum mean discrepancy (MMD), a widely used statistical distance with strong theoretical and computational properties. At its core, MMD relies on kernel mean embeddings (KMEs) to represent distributions as a mean function in the RKHS. However, it remains unclear if the mean function is the only meaningful RKHS representation. Inspired by generalised quantiles, we introduce the notion of kernel quantile embeddings (KQEs) along with a consistent estimator. We then use KQEs to construct a novel family of distances that: (i) are probability metrics under weaker kernel conditions than MMD; (ii) recover a kernelised form of the sliced Wasserstein distance, bridging the gap between the two frameworks; and (iii) can be efficiently estimated with near-linear cost. Through hypothesis testing, we show that these distances offer a competitive alternative to MMD and its fast approximations. Our findings demonstrate the value of representing distributions in Hilbert space beyond simple mean functions, paving the way for new avenues of research.
Paperid:2887
Authors:Jijia Liu · Feng Gao · Qingmin Liao · Chao Yu · Yu Wang
Abstract:
Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Valuebased RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to “kick-start” training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process.To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner.First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average 1.62× performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data.
Paperid:2888
Authors:Xuetong Li · Danyang Huang · Hansheng Wang
Abstract:
We study in this work the problem of multiclass logistic regression with one major class and multiple rare classes, which is motivated by a real application in TikTok live stream data. The model is inspired by the two-class logistic regression model of Wang (2020) but with surprising theoretical findings, which in turn motivate new estimation methods with excellent statistical and computational efficiency. Specifically, since rigorous theoretical analysis suggests that the resulting maximum likelihood estimators of different rare classes should be asymptotically independent, we consider to solve multiple pairwise two-class logistic regression problems instead of optimizing the joint log-likelihood function with computational challenge in multi-class problem, which are computationally much easier and can be conducted in a fully parallel way. To further reduce the computation cost, a subsample-based pairwise likelihood estimator is developed by down-sampling the major class. We show rigorously that the resulting estimators could be as asymptotically efficient as the global maximum likelihood estimator under appropriate regularity conditions. Extensive simulation studies are presented to support our theoretical findings and a TikTok live stream dataset is analyzed for illustration purpose.
Paperid:2889
Authors:Laurens Devos · Timo Martens · Deniz Oruc · Hendrik Blockeel · Wannes Meert · Jesse Davis
Abstract:
Tree ensembles (e.g., gradient boosting decision trees) are often used in practice because they offer excellent predictive performance while still being easy and efficient to learn. In some contexts, it is important to additionally optimize their size: this is specifically the case when models need to have verifiable properties (verification of fairness, robustness, etc. is often exponential in the ensemble's size), or when models run on batterypowered devices (smaller ensembles consume less energy, increasing battery autonomy). For this reason, compression of tree ensembles is worth studying. This paper presents LOP, a method for compressing a given tree ensemble by pruning or entirely removing trees in it, while updating leaf predictions in such a way that predictive accuracy is mostly unaffected. Empirically, LOP achieves compression factors that are often 10 to 100 times better than that of competing methods.
Paperid:2890
Authors:Andi Han · Pierre-Louis Poirion · Akiko Takeda
Abstract:
Optimization with orthogonality constraints frequently arises in various fields such as machine learning. Riemannian optimization offers a powerful framework for solving these problems by equipping the constraint set with a Riemannian manifold structure and performing optimization intrinsically on the manifold. This approach typically involves computing a search direction in the tangent space and updating variables via a retraction operation. However, as the size of the variables increases, the computational cost of the retraction can become prohibitively high, limiting the applicability of Riemannian optimization to largescale problems. To address this challenge and enhance scalability, we propose a novel approach that restricts each update on a random submanifold, thereby significantly reducing the per-iteration complexity. We introduce two sampling strategies for selecting the random submanifolds and theoretically analyze the convergence of the proposed methods. We provide convergence results for general nonconvex functions and functions that satisfy Riemannian Polyak–Łojasiewicz condition as well as for stochastic optimization settings. Additionally, we demonstrate how our approach can be generalized to quotient manifolds derived from the orthogonal manifold. Extensive experiments verify the benefits of the proposed method, across a wide variety of problems.
Paperid:2891
Authors:Aidan Curtis · Eric Li · Michael S Noseworthy · Nishad Gothoskar · Sachin Chitta · Hui Li · Leslie Kaelbling · Nicole Carey
Abstract:
Domain randomization in reinforcement learning is an established technique for increasing the robustness of control policies learned in simulation. By randomizing properties of the environment during training, the learned policy can be robust to uncertainty along the randomized dimensions. While the environment distribution is typically specified by hand, in this paper we investigate the problem of automatically discovering this sampling distribution via entropyregularized reward maximization of a neural sampling distribution in the form of a normalizing flow. We show that this architecture is more flexible and results in better robustness than existing approaches to learning simple parameterized sampling distributions. We demonstrate that these policies can be used to learn robust policies for contact-rich assembly tasks. Additionally, we explore how these sampling distributions, in combination with a privileged value function, can be used for out-of-distribution detection in the context of an uncertainty-aware multi-step manipulation planner.
Paperid:2892
Authors:Arik Reuter · Tim G. J. Rudner · Vincent Fortuin · David Rügamer
Abstract:
Transformers have emerged as the dominant architecture in the field of deep learning, with a broad range of applications and remarkable incontext learning (ICL) capabilities. While not yet fully understood, ICL has already proved to be an intriguing phenomenon, allowing transformers to learn in context—without requiring further training. In this paper, we further advance the understanding of ICL by demonstrating that transformers can perform full Bayesian inference for commonly used statistical models in context. More specifically, we introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows and enables us to infer complex posterior distributions for models such as generalized linear models and latent factor models. Extensive experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods that do not operate in context. The source code for this paper is available at https://anonymous.4open.science/r/ICLforFullBayesianInference-A8D1
Paperid:2893
Authors:Uri Sherman · Tomer Koren · Yishay Mansour
Abstract:
Modern policy optimization methods roughly follow the policy mirror descent (PMD) algorithmic template, for which there are by now numerous theoretical convergence results. However, most of these either target tabular environments, or can be applied effectively only when the class of policies being optimized over satisfies strong closure conditions, which is typically not the case when working with parametric policy classes in largescale environments. In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a generally weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space.
Paperid:2894
Authors:Rustem Islamov · Yarden As · Ilyas Fatkhullin
Abstract:
Abstract:Federated learning faces severe communication bottlenecks due to the high dimensionality of model updates. Communication compression with contractive compressors (e.g., Top$K$) is often preferable in practice but can degrade performance without proper handling. Error feedback (EF) mitigates such issues but has been largely restricted for smooth, unconstrained problems, limiting its real-world applicability where non-smooth objectives and safety constraints are critical. We advance our understanding of EF in the canonical non-smooth convex setting by establishing new lower complexity bounds for first-order algorithms with contractive compression. Next, we propose Safe-EF, a novel algorithm that matches our lower bound (up to a constant) while enforcing safety constraints essential for practical applications. Extending our approach to the stochastic setting, we bridge the gap between theory and practical implementation. Extensive experiments in a reinforcement learning setup, simulating distributed humanoid robot training, validate the effectiveness of Safe-EF in ensuring safety and reducing communication complexity.
Paperid:2895
Authors:Jiawen Wang · Yinda Chen · Xiaoyu Liu · che liu · Dong Liu · Jianqing Gao · Zhiwei Xiong
Abstract:
Recent works have correlated Masked Image Modeling (MIM) with consistency regularization in Unsupervised Domain Adaptation (UDA). However, they merely treat masking as a special form of deformation on the input images and neglect the theoretical analysis, which leads to a superficial understanding of masked reconstruction and insufficient exploitation of its potential in enhancing feature extraction and representation learning. In this paper, we reframe masked reconstruction as a sparse signal reconstruction problem and theoretically prove that the dual form of complementary masks possesses superior capabilities in extracting domainagnostic image features. Based on this compelling insight, we propose MaskTwins, a simple yet effective UDA framework that integrates masked reconstruction directly into the main training pipeline. MaskTwins uncovers intrinsic structural patterns that persist across disparate domains by enforcing consistency between predictions of images masked in complementary ways, enabling domain generalization in an end-to-end manner. Extensive experiments verify the superiority of MaskTwins over baseline methods in natural and biological image segmentation.These results demonstrate the significant advantages of MaskTwins in extracting domain-invariant features without the need for separate pre-training, offering a new paradigm for domain-adaptive segmentation.
Paperid:2896
Authors:Yan Liu · Mingjie Chen · Chaojie Ji · Hao Zhang · Ruxin Wang
Abstract:
Abstract:Recently, the expansion of Variance Reduction (VR) to Riemannian stochastic nonconvex optimization has attracted increasing interest. Inspired by recursive momentum, we first introduce Stochastic Recursive Variance Reduced Gradient (SRVRG) algorithm and further present Stochastic Recursive Gradient Estimator (SRGE) in Euclidean spaces, which unifies the prevailing variance reduction estimators. We then extend SRGE to Riemannian spaces, resulting in a unified Stochastic rEcursive vaRiance reducEd gradieNt frAmework (SERENA) for Riemannian non-convex optimization. This framework includes the proposed R-SRVRG, R-SVRRM, and R-Hybrid-SGD methods, as well as other existing Riemannian VR methods. Furthermore, we establish a unified theoretical analysis for Riemannian non-convex optimization under retraction and vector transport. The IFO complexity of our proposed R-SRVRG and R-SVRRM to converge to $\varepsilon$-accurate solution is $\mathcal{O}\left(\min \{n^{1/2}{\varepsilon^{-2}}, \varepsilon^{-3}\}\right)$ in the finite-sum setting and ${\mathcal{O}\left( \varepsilon^{-3}\right)}$ for the online case, both of which align with the lower IFO complexity bound. Experimental results indicate that the proposed algorithms surpass other existing Riemannian optimization methods.
Paperid:2897
Authors:Thalaiyasingam Ajanthan · Sameera Ramasinghe · Yan Zuo · Gil Avraham · Alexander Long
Abstract:
Pipeline Parallelism (PP) enables large neural network training on small, interconnected devices by splitting the model into multiple stages. To maximize pipeline utilization, asynchronous optimization is appealing as it offers 100% pipeline utilization by construction. However, it is inherently challenging as the weights and gradients are no longer synchronized, leading tostale (or delayed) gradients. To alleviate this, we introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in PP. Specifically, we modify the lookahead step in NAG to effectively address the staleness in gradients. We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients. Our experiments on large-scale language modelling tasks using decoder-only architectures with up to1B parameters, demonstrate that our approach significantly outperforms existing asynchronous methods, even surpassing the synchronous baseline.
Paperid:2898
Authors:Jing Wang · Yu-Jie Zhang · Peng Zhao · Zhi-Hua Zhou
Abstract:
Abstract:We study the stochastic linear bandits with heavytailed noise. Two principled strategies for handling heavy-tailed noise, truncation and median-of-means, have been introduced to heavy-tailed bandits. Nonetheless, these methods rely on specific noise assumptions or bandit structures, limiting their applicability to general settings. The recent work [Huang et al.2024] develop a soft truncation method via the adaptive Huber regression to address these limitations. However, their method suffers undesired computational cost: it requires storing all historical data and performing a full pass over these data at each round. In this paper, we propose a \emph{one-pass} algorithm based on the online mirror descent framework. Our method updates using only current data at each round, reducing the per-round computational cost from $\mathcal{O}(t \log T)$ to $\mathcal{O}(1)$ with respect to current round $t$ and the time horizon $T$, and achieves a near-optimal and variance-aware regret of order $\widetilde{\mathcal{O}}\big(d T^{\frac{1-\varepsilon}{2(1+\varepsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\varepsilon}{2(1+\varepsilon)}}\big)$ where $d$ is the dimension and $\nu_t^{1+\varepsilon}$ is the $(1+\varepsilon)$-th central moment of reward at round $t$.
Paperid:2899
Authors:Wangkai Li · Rui Sun · Bohao Liao · Zhaoyang Li · Tianzhu Zhang
Abstract:
Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer knowledge from a labeled source domain to an unlabeled target domain.Despite the effectiveness of selftraining techniques in UDA, they struggle to learn each class in a balanced manner due to inherent class imbalance and distribution shift in both data and label space between domains.To address this issue, we propose Balanced Learning for Domain Adaptation (BLDA), a novel approach to directly assess and alleviate class bias without requiring prior knowledge about the distribution shift.First, we identify over-predicted and under-predicted classes by analyzing the distribution of predicted logits.Subsequently, we introduce a post-hoc approach to align the logits distributions across different classes using shared anchor distributions.To further consider the network's need to generate unbiased pseudo-labels during self-training, we estimate logits distributions online and incorporate logits correction terms into the loss function.Moreover, we leverage the resulting cumulative density as domain-shared structural knowledge to connect the source and target domains.Extensive experiments on two standard UDA semantic segmentation benchmarks demonstrate that BLDA consistently improves performance, especially for under-predicted classes, when integrated into various existing methods.
Paperid:2900
Authors:Maximilian Graf · Victor Thuot · Nicolas Verzelen
Abstract:
Abstract:We study the problem of clustering a set of items based on bandit feedback. Each of the $n$ items is characterized by a feature vector, with a possibly large dimension $d$. The items are partitioned into two unknown groups, such that items within the same group share the same feature vector. We consider a sequential and adaptive setting in which, at each round, the learner selects one item and one feature, then observes a noisy evaluation of the item's feature. The learner's objective is to recover the correct partition of the items, while keeping the number of observations as small as possible. We provide an algorithm which relies on finding a relevant feature for the clustering task, leveraging the Sequential Halving algorithm. With probability at least $1\delta$, we obtain an accurate recovery of the partition and derive an upper bound on the required budget . Furthermore, we derive an instance-dependent lower bound, which is tight in some relevant cases.
Paperid:2901
Authors:Susan Liang · Dejan Markovic · Israel D. Gebru · Steven Krenn · Todd Keebler · Jacob Sandakly · Frank Yu · Samuel Hassel · Chenliang Xu · Alexander Richard
Abstract:
Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing highquality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a 42% confusion rate.
Paperid:2902
Authors:Zifan LIU · Xinran Li · Jun Zhang
Abstract:
Safe offline reinforcement learning (SORL) aims to develop policies that maximize cumulative rewards while satisfying safety constraints without the need for risky online interaction. However, existing SORL methods often struggle with the outof-distribution (OOD) problem, leading to potentially unsafe and suboptimal policies. To address this issue, we first propose Constrained Implicit Q-learning (CIQL), a novel algorithm designed to avoid the OOD problem in SORL. In particular, CIQL expands the implicit update of reward value functions to constrained settings and then estimates cost value functions under the same implicit policy. Despite its advantages, the further performance improvement of CIQL is still hindered by the inaccurate discounted approximations of constraints. Thus, we further propose Constraint-Conditioned Implicit Q-learning (C2IQL). Building upon CIQL, C2IQL employs a cost reconstruction model to derive non-discounted cumulative costs from discounted values and incorporates a flexible, constraint-conditioned mechanism to accommodate dynamic safety constraints. Experiment results on DSRL benchmarks demonstrate the superiority of C2IQL compared to baseline methods in achieving higher rewards while guaranteeing safety constraints under different threshold conditions.
Paperid:2903
Authors:Venkatraman · Mohsin Hasan · Minsu Kim · Luca Scimeca · Marcin Sendera · Yoshua Bengio · Glen Berseth · Nikolay Malkin
Abstract:
Abstract:Any wellbehaved generative model over a variable $\mathbf{x}$ can be expressed as a deterministic transformation of an exogenous (*`outsourced'*) Gaussian noise variable $\mathbf{z}$: $\mathbf{x}=f_\theta(\mathbf{z})$. In such a model (e.g., a VAE, GAN, or continuous-time flow-based model), sampling of the target variable $\mathbf{x} \sim p_\theta(\mathbf{x})$ is straightforward, but sampling from a posterior distribution of the form $p(\mathbf{x}\mid\mathbf{y}) \propto p_\theta(\mathbf{x})r(\mathbf{x},\mathbf{y})$, where $r$ is a constraint function depending on an auxiliary variable $\mathbf{y}$, is generally intractable. We propose to amortize the cost of sampling from such posterior distributions with diffusion models that sample a distribution in the noise space ($\mathbf{z}$). These diffusion samplers are trained by reinforcement learning algorithms to enforce that the transformed samples $f_\theta(\mathbf{z})$ are distributed according to the posterior in the data space ($\mathbf{x}$). For many models and constraints of interest, the posterior in the noise space is smoother than the posterior in the data space, making it more amenable to such amortized inference. Our method enables conditional sampling under unconditional GAN, (H)VAE, and flow-based priors, comparing favorably both with current amortized and non-amortized inference methods. We demonstrate the proposed *outsourced diffusion sampling* in several experiments with large pretrained prior models: conditional image generation, reinforcement learning with human feedback, and protein structure generation.
Paperid:2904
Authors:Pablo Barcelo · Alexander Kozachinskiy · Tomasz Steifer
Abstract:
Abstract:The notion of rank of a Boolean function has been a cornerstone in PAC learning, enabling quasipolynomialtime learning algorithms for polynomial-size decision trees. We present a novel characterization of rank, grounded in the well-known Transformer architecture. We show that the rank of a function $f$ corresponds to the minimum number of Chain of Thought (CoT) steps required by a single-layer Transformer with hard attention to compute $f$. Based on this characterization we establish tight bounds on the number of CoT steps required for specific problems, showing that $\ell$-fold function composition necessitates exactly $\ell$ CoT steps. Furthermore, we analyze the problem of identifying the position of the $k$-th occurrence of 1 in a Boolean sequence, proving that it requires $k$ CoT steps.
Paperid:2905
Authors:Archiki Prasad · Weizhe Yuan · Richard Yuanzhe Pang · Jing Xu · Maryam Fazel-Zarandi · Mohit Bansal · Sainbayar Sukhbaatar · JASON WESTON · Jane Dwivedi-Yu
Abstract:
Selfalignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
Paperid:2906
Authors:Phillip Guo · Aaquib Syed · Abhay Sheshadri · Aidan Ewart · Gintare Karolina Dziugaite
Abstract:
Methods for knowledge editing and unlearning in large language models seek to edit or remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates how mechanistic interpretabilitywhich, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability---can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with thelookup-table mechanismfor factual recall 1) leads to more robust edits/unlearning across different input/output formats, and 2) resists attempts to relearn the unwanted information, while also reducing unintended side effects compared to baselines, on both a sports facts dataset and the CounterFact dataset across multiple models.We also find that certain localized edits disrupt the latent knowledge in the model more than any other baselines, making unlearning more robust to various attacks.
Paperid:2907
Authors:Hanxun Huang · Sarah Erfani · Yige Li · Xingjun Ma · James Bailey
Abstract:
As Contrastive LanguageImage Pre-training (CLIP) models are increasingly adopted for diverse downstream tasks and integrated into large vision-language models (VLMs), their susceptibility to adversarial perturbations has emerged as a critical concern. In this work, we introduceX-Transfer, a novel attack method that exposes a universal adversarial vulnerability in CLIP. X-Transfer generates a Universal Adversarial Perturbation (UAP) capable of deceiving various CLIP encoders and downstream VLMs across different samples, tasks, and domains. We refer to this property assuper transferability—a single perturbation achieving cross-data, cross-domain, cross-model, and cross-task adversarial transferability simultaneously. This is achieved throughsurrogate scaling, a key innovation of our approach. Unlike existing methods that rely on fixed surrogate models, which are computationally intensive to scale, X-Transfer employs an efficient surrogate scaling strategy that dynamically selects a small subset of suitable surrogates from a large search space. Extensive evaluations demonstrate that X-Transfer significantly outperforms previous state-of-the-art UAP methods, establishing a new benchmark for adversarial transferability across CLIP models.
Paperid:2908
Authors:Tuomas Oikarinen · Ge Yan · Lily Weng
Abstract:
Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare and contrast existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical concepts on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our experimental and theoretical results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluationmetrics.
Paperid:2909
Authors:Jiawei Huang · Bingcong Li · Christoph Dann · Niao He
Abstract:
Abstract:Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sampleefficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property of the KL-regularized RLHF objective: \emph{a policy's ability to cover the optimal policy is captured by its sub-optimality}. Building on this insight, we propose a theoretical transfer learning algorithm with provable benefits compared to standard online learning. Our approach achieves low regret in the early stage by quickly adapting to the best available source reward models without prior knowledge of their quality, and over time, it attains an $\tilde{O}(\sqrt{T})$ regret bound \emph{independent} of structural complexity measures. Inspired by our theoretical findings, we develop an empirical algorithm with improved computational efficiency, and demonstrate its effectiveness empirically in summarization tasks.
Paperid:2910
Authors:Dongya Jia · Zhuo Chen · Jiawei Chen · Chenpeng Du · Jian Wu · Jian Cong · Xiaobin Zhuang · Chumin Li · Zhen Wei · Yuping Wang · Yuxuan Wang
Abstract:
Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patchbased autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands.DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.
Paperid:2911
Authors:Kai Yi · Kiarash Jamali · Sjors Scheres
Abstract:
The recent breakthrough of AlphaFold3 in modeling complex biomolecular interactions, including those between proteins and ligands, nucleotides, or metal ions, creates new opportunities for protein design. In socalled inverse protein folding, the objective is to find a sequence of amino acids that adopts a target protein structure. Many inverse folding methods struggle to predict sequences for complexes that contain non-protein components, and perform poorly with complexes that adopt multiple structural states. To address these challenges, we present ADFLIP (All-atom Discrete FLow matching Inverse Protein folding), a generative model based on discrete flow-matching for designing protein sequences conditioned on all-atom structural contexts. ADFLIP progressively incorporates predicted amino acid side chains as structural context during sequence generation and enables the design of dynamic protein complexes through ensemble sampling across multiple structural states. Furthermore, ADFLIP implements training-free classifier guidance sampling, which allows the incorporation of arbitrary pre-trained models to optimize the designed sequence for desired protein properties. We evaluated the performance of ADFLIP on protein complexes with small-molecule ligands, nucleotides, or metal ions, including dynamic complexes for which structure ensembles were determined by nuclear magnetic resonance (NMR). Our model achieves state-of-the-art performance in single-structure and multi-structure inverse folding tasks, demonstrating excellent potential for all-atom protein design.
Paperid:2912
Authors:Shengju Yu · Dong Zhibin · Siwei Wang · Suyuan Liu · KE LIANG · Xinwang Liu · Yue Liu · Yi Zhang
Abstract:
Current multiview clustering (MVC) techniques generally focus only on the relationship between anchors and samples, while overlooking that between anchors. Moreover, due to the lack of data labels, the cluster order is inconsistent across views and accordingly anchors encounter misalignment, which will confuse the graph structure and disorganize cluster representation. Even worse, it typically brings variance during forming spectral embedding, degenerating the stability of clustering results. In response to these concerns, in the paper we propose a MVC approach named DTP-SF-BVF. Concretely, we explicitly exploit the geometric properties between anchors via self-expression learning skill, and utilize topology learning strategy to feed captured anchor-anchor features into anchor-sample graph so as to explore the manifold structure hidden within samples more adequately. To reduce the misalignment risk, we introduce a permutation mechanism for each view to jointly rearrange anchors according to respective view characteristics. Besides not involving selecting the baseline view, it also can coordinate with anchors in the unified framework and thereby facilitate the learning of anchors. Further, rather than forming spectrum and then performing embedding partitioning, based on the criterion that samples and clusters should be hard assignment, we manage to construct the cluster labels directly from original samples using the binary strategy, not only preserving the data diversity but avoiding variance. Experiments on multiple publicly available datasets confirm the effectiveness of proposed DTP-SF-BVF method.
Paperid:2913
Authors:Biao Liu · Ning Xu · Xin Geng
Abstract:
Large Language Models (LLM) alignment aims to prevent models from producing content that misaligns with human expectations, which can lead to ethical and legal concerns. In the last few years, Reinforcement Learning from Human Feedback (RLHF) has been the most prominent method for achieving alignment. Due to challenges in stability and scalability with RLHF stages, which arise from the complex interactions between multiple models, researchers are exploring alternative methods to achieve effects comparable to those of RLHF. However, these methods often rely on large highquality datasets. Despite some methods considering the generation of additional data to expand datasets, they often treat model training and data generation as separate and static processes, overlooking the fact that these processes are highly interdependent, leading to inefficient utilization of the generated data. To deal with this problem, we propose {\ours}, i.e., Progressively Label Enhancement for LLM Alignment, a framework that dynamically adjusts the model’s training process based on the evolving quality of the generated data. Specifically, we prompt the model to generate responses for both the original query and a set of carefully designed principle guided query, and then utilize a dynamic threshold to determine the appropriate training approach for both responses based on their corresponding reward scores. Experimental results demonstrate the effectiveness of {\ours} compared to existing LLM alignment methods.
Paperid:2914
Authors:Yilei Chen · Aldo Pacchiano · Ioannis Paschalidis
Abstract:
Abstract:We study the multiplepolicy evaluation problem where we are given a set of $K$ policies and the goal is to evaluate their performance (expected total reward over a fixed horizon) to an accuracy $\epsilon$ with probability at least $1-\delta$. We propose an algorithm named CAESAR for this problem. Our approach is based on computing an approximately optimal sampling distribution and using the data sampled from it to perform the simultaneous estimation of the policy values. CAESAR has two phases. In the first phase, we produce coarse estimates of the visitation distributions of the target policies at a low order sample complexity rate that scales with $\tilde{O}(\frac{1}{\epsilon})$. In the second phase, we approximate the optimal sampling distribution and compute the importance weighting ratios for all target policies by minimizing a step-wise quadratic loss function inspired by the DualDICE \cite{nachum2019dualdice} objective. Up to low order and logarithmic terms CAESAR achieves a sample complexity $\tilde{O}\left(\frac{H^4}{\epsilon^2}\sum_{h=1}^H\max_{k\in[K]}\sum_{s,a}\frac{(d_h^{\pi^k}(s,a))^2}{\mu^*_h(s,a)}\right)$, where $d^{\pi}$ is the visitation distribution of policy $\pi$, $\mu^*$ is the optimal sampling distribution, and $H$ is the horizon.
Paperid:2915
Authors:Tong Wu · Liangxiao Jiang · Wenjun Zhang · Chaoqun Li
Abstract:
In realworld crowdsourcing scenarios, most workers often annotate a few instances only, which results in a significantly sparse crowdsourced label matrix and subsequently harms the performance of label integration algorithms. Recent work called worker similarity-based label completion (WSLC) has been proven to be an effective algorithm to addressing this issue. However, WSLC considers solely the correlation of the labels annotated by different workers on per individual instance while totally ignoring the correlation of the labels annotated by different workers among similar instances. To fill this gap, we propose a novel label distribution propagation-based label completion (LDPLC) algorithm. At first, we use worker similarity weighted majority voting to initialize a label distribution for each missing label. Then, we design a label distribution propagation algorithm to enable each missing label of each instance to iteratively absorb its neighbors’ label distributions. Finally, we complete each missing label based on its converged label distribution. Experimental results on both real-world and simulated crowdsourced datasets show that LDPLC significantly outperforms WSLC in enhancing the performance of label integration algorithms.
Paperid:2916
Authors:Zhun Mou · Bin Xia · Zhengchao Huang · Wenming Yang · Jiaya Jia
Abstract:
Recent great advances in video generation models have demonstrated their potential to produce highquality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack high-level semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curateGRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduceGRADEO, one of the first specifically designed video evaluation models, whichgradesAI-generatedvideosfor explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. The models, datasets, and codes will be released soon.
Paperid:2917
Authors:Bang You · Puze Liu · Huaping Liu · Jan Peters · Oleg Arenz
Abstract:
Simplicity is a powerful inductive bias. In reinforcement learning, regularization is used for simpler policies, data augmentation for simpler representations, and sparse reward functions for simpler objectives, all that, with the underlying motivation to increase generalizability and robustness by focusing on the essentials. Supplementary to these techniques, we investigate how to promote simple behavior throughout the duration of the episode. To that end, we introduce a modification of the reinforcement learning problem, that additionally maximizes the total correlation within the induced trajectories. We propose a practical algorithm that optimizes all models, including policy and state representation, based on a lower bound approximation. In simulated robot control environments, our method naturally generates policies that induce periodic and compressible trajectories, and that exhibit superior robustness to noise and changes in dynamics compared to baseline methods, while also improving performance in the original tasks.
Paperid:2918
Authors:Jeremy McMahan
Abstract:
Abstract:We extend anytime constraints to the Markov game setting and the corresponding solution concept of anytimeconstrained equilibrium (ACE). Then, we present a comprehensive theory of anytime-constrained equilibria that includes (1) a computational characterization of feasible policies, (2) a fixed-parameter tractable algorithm for computing ACE, and (3) a polynomial-time algorithm for approximately computing ACE. Since computing a feasible policy is NP-hard even for two-player zero-sum games, our approximation guarantees are the best possible unless $P = NP$. We also develop the first theory of efficient computation for action-constrained Markov games, which may be of independent interest.
Paperid:2919
Authors:Konstantin Kirchheim · Frank Ortmeier
Abstract:
Outof-distribution (OOD) detection is essential for ensuring the reliability of deep learning models operating in open-world scenarios. Current OOD detectors mainly rely on statistical models to identify unusual patterns in the latent representations of a deep neural network. This work proposes to augment existing OOD detectors with probabilistic reasoning, utilizing Markov Logic Networks (MLNs). MLNs connect first-order logic with probabilistic reasoning to assign probabilities to inputs based on weighted logical constraints defined over human-understandable concepts, which offers improved explainability. Through extensive experiments on multiple datasets, we demonstrate that MLNs can significantly enhance the performance of a wide range of existing OOD detectors while maintaining computational efficiency. Furthermore, we introduce a simple algorithm for learning logical constraints for OOD detection from a dataset and showcase its effectiveness.
Paperid:2920
Authors:Umit Basaran · Sk Miraj Ahmed · Amit Roy-Chowdhury · Basak Guler
Abstract:
With the growing adoption of machine learning and evolving data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the original data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal without relying on the original training data. Our approach utilizes a surrogate dataset that approximate the statistical properties of the original data, allowing for controlled noise scaling based on the statistical distance between the surrogate and original data distributions. This ensures strong guarantees on the model’s behavior postunlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.
Paperid:2921
Authors:Myunsoo Kim · Hayeong Lee · Seong-Woong Shim · JunHo Seo · Byung-Jun Lee
Abstract:
Intelligent agents are able to make decisions based on different levels of granularity and duration. Recent advances in skill learning enabled the agent to solve complex, longhorizon tasks by effectively guiding the agent in choosing appropriate skills. However, the practice of using fixed-length skills can easily result in skipping valuable decision points, which ultimately limits the potential for further exploration and faster policy learning.In this work, we propose to learn a simple and effective termination condition that identifies decision points through a state-action novelty module that leverages agent experience data.Our approach, Novelty-based Decision Point Identification (NBDI), outperforms previous baselines in complex, long-horizon tasks, and remains effective even in the presence of significant variations in the environment configurations of downstream tasks, highlighting the importance of decision point identification in skill learning.
Paperid:2922
Authors:Charles O'Neill · Alim Gumran · David Klindt
Abstract:
A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations. However, the simple linearnonlinear encoding mechanism in SAEs limits their ability to perform accurate sparse inference. Using compressed sensing theory, we prove that an SAE encoder is inherently insufficient for accurate sparse inference, even in solvable cases. We then decouple encoding and decoding processes to empirically explore conditions where more sophisticated sparse inference methods outperform traditional SAE encoders. Our results reveal substantial performance gains with minimal compute increases in correct inference of sparse codes. We demonstrate this generalises to SAEs applied to large language models, where more expressive encoders achieve greater interpretability. This work opens new avenues for understanding neural network representations and analysing large language model activations.
Paperid:2923
Authors:Kyle Heuton · Frederick Muench · Shikhar Shrestha · Thomas J Stopka · Michael Hughes
Abstract:
Optimal allocation of scarce resources is a common problem for decision makers faced with choosing a limited number of locations for intervention. Spatiotemporal prediction models could make such decisions datadriven. A recent performance metric called fraction of best possible reach (BPR) measures the impact of using a model's recommended size K subset of sites compared to the best possible top-K in hindsight. We tackle two open problems related to BPR. First, we explore how to rank all sites numerically given a probabilistic model that predicts event counts jointly across sites. Ranking via the per-site mean is suboptimal for BPR. Instead, we offer a better ranking for BPR backed by decision theory. Second, we explore how to train a probabilistic model's parameters to maximize BPR. Discrete selection of K sites implies all-zero parameter gradients which prevent standard gradient training. We overcome this barrier via advances in perturbed optimizers. We further suggest a training objective that combines likelihood with a decision-aware BPR constraint to deliver high-quality top-K rankings as well as good forecasts for all sites. We demonstrate our approach on two where-to-intervene applications: mitigating opioid-related fatal overdoses for public health and monitoring endangered wildlife.
Paperid:2924
Authors:Dong HUANG · Guangtao Zeng · Jianbo Dai · Meng Luo · Han Weng · Yuhao QING · Heming Cui · Zhijiang Guo · Jie Zhang
Abstract:
As large language models (LLMs) play an increasingly important role in code generation, enhancing both correctness and efficiency has become crucial. Current methods primarily focus on correctness, often overlooking efficiency. To address this gap, we introduce SWIFTCODE to improve both aspects by finetuning LLMs on a high-quality dataset comprising correct and efficient code samples. Our methodology involves leveraging multiple LLMs to generate diverse candidate code solutions for various tasks across different programming languages. We then evaluate these solutions by directly measuring their execution time and memory usage through local execution. The code solution with the lowest execution time and memory consumption is selected as the final output for each task. Experimental results demonstrate significant improvements when fine-tuning with SWIFTCODE. For instance, Qwen2.5-Coder-7B-Instruct's pass@1 score increases from 44.8\% to 57.7\%, while the average execution time for correct tasks decreases by 48.4\%. SWIFTCODE offers a scalable and effective solution for advancing AI-driven code generation, benefiting both software development and computational problem-solving.
Paperid:2925
Authors:Arun Ganesh · Ryan McKenna · Hugh B McMahan · Adam Smith · Fan Wu
Abstract:
We initiate a study of algorithms for model training with userlevel differential privacy (DP), where each example may be attributed to multiple users, which we call the multi-attribution model. We first provide a carefully chosen definition of user-level DP under the multi-attribution model. Training in the multi-attribution model is facilitated by solving the contribution bounding problem, i.e. the problem of selecting a subset of the dataset for which each user is associated with a limited number of examples. We propose a greedy baseline algorithm for the contribution bounding problem. We then empirically study this algorithm for a synthetic logistic regression task and a transformer training task, including studying variants of this baseline algorithm that optimize the subset chosen using different techniques and criteria. We find that the baseline algorithm remains competitive with its variants in most settings, and build a better understanding of the practical importance of a bias-variance tradeoff inherent in solutions to the contribution bounding problem.
Paperid:2926
Authors:Anian Ruoss · Fabio Pardo · Harris Chan · Bonnie Li · Vlad Mnih · Tim Genewein
Abstract:
In this paper, we present a benchmark to pressuretest today’s frontier models’ multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context — from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.
Paperid:2927
Authors:Shiwei Li · Xiandi Luo · Xing Tang · Haozhao Wang · Hao Chen · weihongluo · Yuhua Li · xiuqiang He · Ruixuan Li
Abstract:
Abstract:Lowrank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method. In standard LoRA layers, one of the matrices, $A$ or $B$, is initialized to zero, ensuring that fine-tuning starts from the pretrained model. However, there is no theoretical support for this practice.In this paper, we investigate the impact of non-zero initialization on LoRA's fine-tuning dynamics from an infinite-width perspective. Our analysis reveals that, compared to zero initialization, simultaneously initializing $A$ and $B$ to non-zero values improves LoRA's robustness to suboptimal learning rates, particularly smaller ones. Further analysis indicates that although the non-zero initialization of $AB$ introduces noise into the pretrained model, it generally does not affect fine-tuning performance. In other words, fine-tuning does not need to strictly start from the pretrained model.The validity of our findings is confirmed through extensive experiments across various models and datasets. The code is available in an anonymous repository at https://anonymous.4open.science/r/non_zero_lora.
Paperid:2928
Authors:Haoyang Jiang · Jindong Wang · Xingquan Zhu · Yi He
Abstract:
Graph Neural Networks (GNNs) often struggle in preserving highfrequency components of nodal signals when dealing with directed graphs. Such components are crucial for modeling flow dynamics, without which a traditional GNN tends to treat a graph with forward and reverse topologies equal. To make GNNs sensitive to those high-frequency components thereby being capable to capture detailed opological differences, this paper proposes a novel framework that combines 1) explicit difference matrices that model directional gradients and 2) implicit physical constraints that enforce messages passing within GNNs to be consistent with natural laws. Evaluations on two real-world directed graph data, namely, water flux network and urban traffic flow network, demonstrate the effectiveness of our proposal.
Paperid:2929
Authors:Tianze Wang · Dongnan Gui · Yifan Hu · Shuhang Lin · Linjun Zhang
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multidimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.
Paperid:2930
Authors:Lifan Yuan · Wendi Li · Huayu Chen · Ganqu Cui · Ning Ding · Kaiyan Zhang · Bowen Zhou · Zhiyuan Liu · Hao Peng
Abstract:
Different from its counterpart outcome reward models (ORMs), which evaluate the entire responses, a process reward model (PRM) scores a reasoning trajectory step by step, providing denser and more finegrained rewards. However, training a PRM requires labels annotated at every intermediate step, presenting significant challenges for both manual and automatic data collection. This paper aims to address this challenge. Both theoretically and empirically, we show that an implicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels. The only assumption is to parameterize the outcome reward as the log-likelihood ratios of the policy and reference models rϕ(y) = β log πϕ(y) πref(y) , which can be optimized regardless of the specific choice of loss objectives. In experiments, we instantiate our implicit PRMs with various objectives and evaluate their performance on MATH. We show that our implicit PRM outperforms a strong MCTS-based baseline á la Math-Shepherd (Wang et al., 2023) using less than 1/38 of the training data. Its performance can be further improved with majority voting. We further find that scaling up instructions and responses benefits our implicit PRM, and the latter brings a larger gain. Particularly, we find that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction, the setup that suffers from extreme data scarcity and imbalance. Further, instructions should be relevant to downstream tasks while the diversity of responses does not bring gains. Surprisingly, training on extra Math-Shepherd step labels brings no further improvements to our implicit PRM trained on only outcome data. We hope that our work will encourage a rethinking of PRM training approaches and contribute to making training PRMs more accessible.
Paperid:2931
Authors:Belinda Li · Carl Guo · Jacob Andreas
Abstract:
Transformer language models (LMs) exhibit behaviors—from storytelling to code generation—that appear to require tracking the unobservedstate of a changing world. How do they do so?We study state tracking in LMs trained or finetuned to track track permutation word problems (i.e.,to compute the order of a set of objects after asequence of swaps). Despite the simple algebraicstructure of this problem, many other tasks (e.g.,simulation of finite automata and evaluation ofboolean expressions) can be reduced to permutation word problems, making it a natural modelfor state tracking in general. We show that LMsconsistently learn one of two state tracking mechanisms for this task. The first closely resemblesthe “associative scan” construction used in recenttheoretical work by Liu et al. (2023) and Merrillet al. (2024). The second uses an easy-to-computefeature (permutation parity) to partially prune thespace of outputs, then refines this with an associative scan. The two mechanisms exhibit markedlydifferent robustness properties, and we show howto steer LMs toward one or the other with inter-mediate training tasks that encourage or suppressthe heuristics. Our results demonstrate that trans-former LMs, whether pretrained or fine-tuned, canlearn to implement efficient, interpretable, andgeneralizable state tracking procedures.
Paperid:2932
Authors:Ruth Wan Theng Chew · Quoc Phong Nguyen · Bryan Kian Hsiang Low
Abstract:
Bilevel optimization is characterized by a twolevel optimization structure, where the upper-level problem is constrained by optimal lower-level solutions, and such structures are prevalent in real-world problems. The constraint by optimal lower-level solutions poses significant challenges, especially in noisy, constrained, and derivative-free settings, as repeating lower-level optimizations is sample inefficient and predicted lower-level solutions may be suboptimal. We present BILevel Bayesian Optimization (BILBO), a novel Bayesian optimization algorithm for general bilevel problems with blackbox functions, which optimizes both upper- and lower-level problems simultaneously, without the repeated lower-level optimization required by existing methods. BILBO samples from confidence-bounds based trusted sets, which bounds the suboptimality on the lower level. Moreover, BILBO selects only one function query per iteration, where the function query selection strategy incorporates the uncertainty of estimated lower-level solutions and includes a conditional reassignment of the query to encourage exploration of the lower-level objective. The performance of BILBO is theoretically guaranteed with a sublinear regret bound for commonly used kernels and is empirically evaluated on several synthetic and real-world problems.
Paperid:2933
Authors:Jack Min Ong · Matthew Di Ferrante · Aaron Pazdera · Ryan Garner · Sami Jaghouar · Manveer Basra · Max Ryabinin · Johannes Hagemann
Abstract:
Abstract:Large language models (LLMs) have proven to be very capable, but access to the best models currently rely on inference providers which introduces trust challenges how can we be sure that the provider is using the model configuration they claim?We propose TOPLOC, a novel method for verifiable inference that addresses this problem.TOPLOC leverages a compact locality sensitive hashing mechanism for intermediate activations which can detect unauthorized modifications to models, prompts, or precision with 100\% accuracy, achieving no false positives or negatives in our empirical evaluations.Our approach is robust across diverse hardware configurations, GPU types, and algebraic reorderings, which allows for validation speeds significantly faster than the original inference.By introducing a polynomial encoding scheme, TOPLOC minimizes memory overhead of the generated commits by $1000\times$, requiring only 258 bytes of storage per 32 new tokens compared to the 262KB requirement of storing the token embeddings directly for Llama-3.1-8B-Instruct.Our method empowers users to verify LLM inference computations efficiently, fostering greater trust and transparency in open ecosystems and lays a foundation for verifiable trustless AI services.
Paperid:2934
Authors:Lisa Jin · Jianhao Ma · Zechun Liu · Andrey Gromov · Aaron Defazio · Lin Xiao
Abstract:
We develop a principled method for quantizationaware training (QAT). Specifically, we show that convex, piecewise-affine regularization (PAR) can effectively induce neural network weights to cluster towards discrete values. We minimize PAR-regularized loss functions using an aggregate proximal stochastic gradient method (AProx) and prove that it enjoys last-iterate convergence. Our approach provides an interpretation of the straight-through estimator (STE), a widely used heuristic for QAT, as the asymptotic form of PARQ. We conduct experiments to demonstrate that PARQ obtains competitive performance on convolution- and transformer-based vision tasks.
Paperid:2935
Authors:Chenlong Wang · Zhaoyang Chu · Zhengxiang Cheng · Xuyi Yang · Kaiyue Qiu · Yao Wan · Zhou Zhao · Xuanhua Shi · Hai Jin · Dongping Chen
Abstract:
Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of thirdparty library APIs. This limitation, stemming from static training dataset, often results in non-executable code or implementations with suboptimal safety and efficiency. This paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time API knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs’ abilities to stay synchronized with API evolution, which includes updates for 220 APIs from six Python libraries. Our benchmark provides 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset comprising 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic API evolution, even with the support of knowledge updating approaches (e.g., SFT, DPO, and ORPO). We believe our benchmark offers a strong foundation for the development of more effective real-time API knowledge updating methods in the future. The source code and dataset are publicly available at: https://anonymous.4open.science/r/CodeSync-5A9D/.
Paperid:2936
Authors:Greyson Brothers
Abstract:
We investigate the design of pooling methods used to aggregate transformer embeddings for nonsequential tasks, primarily motivated by reinforcement learning applications. This work considers problems where a subset of the input vectors contain requisite information (signal) for a downstream task while the rest are distractors (noise). By framing pooling methods as vector quantizers with the goal of minimizing signal loss, we demonstrate the standard pooling approaches used to aggregate transformer outputs, MaxPool and AvgPool, are vulnerable to performance collapse as the signal-to-noise ratio (SNR) of inputs fluctuates. We then show that an attention-based pooling method can approximate the signal-optimal vector quantizer within derived error bounds for any SNR. Our theoretical results are validated by supervised experiments on a synthetic dataset designed to evaluate the SNR problem in isolation. These results are then generalized to standard relational reasoning and multi-agent reinforcement learning benchmark environments with noisy observations, where transformers with attention-based pooling display superior robustness across tasks.
Paperid:2937
Authors:Mengting Ma · Yizhen Jiang · Mengjiao Zhao · Jiaxin Li · Wei Zhang
Abstract:
Remote sensing pansharpening aims to reconstruct spatialspectral properties during the fusion of panchromatic (PAN) images and low-resolution multi-spectral (LR-MS) images, finally generating the high-resolution multi-spectral (HR-MS) images. In the mainstream modeling strategies, i.e., CNN and Transformer, the input images are treated as the equal-sized grid of pixels in the Euclidean space. They have limitations in facing remote sensing images with irregular ground objects. Graph is the more flexible structure, however, there are two major challenges when modeling spatial-spectral properties with graph: 1) constructing the customized graph structure for spatial-spectral relationship priors; 2) learning the unified spatial-spectral representation through the graph. To address these challenges, we propose the spatial-spectral heterogeneous graph learning network, named HetSSNet.Specifically, HetSSNet initially constructs the heterogeneous graph structure for pansharpening, which explicitly describes pansharpening-specific relationships. Subsequently, the basic relationship pattern generation module is designed to extract the multiple relationship patterns from the heterogeneous graph. Finally, relationship pattern aggregation module is exploited to collaborativelylearn unified spatial-spectral representation across different relationships among nodes with adaptive importance learning from local and global perspectives. Extensive experiments demonstrate the significant superiority and generalization of HetSSNet.
Paperid:2938
Authors:Ang Lv · Ruobing Xie · Yining Qian · Songhao Wu · Xingwu Sun · Zhanhui Kang · Di Wang · Rui Yan
Abstract:
Mixtureof-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and learning. To address this, we propose Autonomy-of-Expert (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.
Paperid:2939
Authors:Md Yousuf Harun · Jhair Gallardo · Christopher Kanan
Abstract:
Outof-distribution (OOD) detection and OOD generalization are widely studied in Deep Neural Networks (DNNs), yet their relationship remains poorly understood. We empirically show that the degree of Neural Collapse (NC) in a network layer is inversely related with these objectives: stronger NC improves OOD detection but degrades generalization, while weaker NC enhances generalization at the cost of detection. This trade-off suggests that a single feature space cannot simultaneously achieve both tasks. To address this, we develop a theoretical framework linking NC to OOD detection and generalization. We show that entropy regularization mitigates NC to improve generalization, while a fixed Simplex ETF projector enforces NC for better detection. Based on these insights, we propose a method to control NC at different DNN layers. In experiments, our method excels at both tasks across OOD datasets and DNN architectures.
Paperid:2940
Authors:Chaitanya Joshi · Xiang Fu · Yi-Lun Liao · Vahe Gharakhanyan · Benjamin Kurt Miller · Anuroop Sriram · Zachary Ulissi
Abstract:
Diffusion models are the standard toolkit for generative modelling of 3D atomic systems. However, for different types of atomic systems – such as molecules and materials – the generative processes are usually highly specific to the target system despite the underlying physics being the same. We introduce the Allatom Diffusion Transformer (ADiT), a unified latent diffusion framework for jointly generating both periodic materials and non-periodic molecular systems using the same model: (1) An autoencoder maps a unified, all-atom representations of molecules and materials to a shared latent embedding space; and (2) A diffusion model is trained to generate new latent embeddings that the autoencoder can decode to sample new molecules or materials. Experiments on QM9 and MP20 datasets demonstrate that jointly trained ADiT generates realistic and valid molecules as well as materials, exceeding state-of-the-art results from molecule and crystal-specific models. ADiT uses standard Transformers, resulting in significant speedups during training and inference compared to equivariant diffusion models. Scaling ADiT up to half a billion parameters predictably improves performance, representing a step towards broadly generalizable foundation models for generative chemistry.
Paperid:2941
Authors:Vsevolod Viliuga · Leif Seute · Nicolas Wolf · Simon Wagner · Arne Elofsson · Jan Stuehmer · Frauke Gräter
Abstract:
Recent advances in geometric deep learning and generative modeling have enabled the design of novel proteins with a wide range of desired properties. However, current stateof-the-art approaches are typically restricted to generating proteins with only static target properties, such as motifs and symmetries. In this work, we take a step towards overcoming this limitation by proposing a framework to condition structure generation on flexibility, which is crucial for key functionalities such as catalysis or molecular recognition. We first introduce BackFlip, an equivariant neural network for predicting per-residue flexibility from an input backbone structure. Relying on BackFlip, we propose an SE(3)-equivariant conditional flow matching model that solves the inverse problem, that is, generating backbones that display a target flexibility profile. In our experiments, we show that GAFL-Flex is able to generate novel protein backbones with the desired local flexibility, verified by Molecular Dynamics (MD) simulations.
Paperid:2942
Authors:Spencer Young · Porter Jenkins · Longchao Da · Jeffrey Dotson · Hua Wei
Abstract:
Abstract:Neural networks capable of accurate, inputconditional uncertainty representation are essential for real-world AI systems. Deep ensembles of Gaussian networks have proven highly effective for continuous regression due to their ability to flexibly represent aleatoric uncertainty via unrestricted heteroscedastic variance, which in turn enables accurate epistemic uncertainty estimation. However, no analogous approach exists for $\textit{count}$ regression, despite many important applications. To address this gap, we propose the Deep Double Poisson Network (DDPN), a novel neural discrete count regression model that outputs the parameters of the Double Poisson distribution, enabling arbitrarily high or low predictive aleatoric uncertainty for count data and improving epistemic uncertainty estimation when ensembled. We formalize and prove that DDPN exhibits robust regression properties similar to heteroscedastic Gaussian models via learnable loss attenuation, and introduce a simple loss modification to control this behavior. Experiments on diverse datasets demonstrate that DDPN outperforms current baselines in accuracy, calibration, and out-of-distribution detection, establishing a new state-of-the-art in deep count regression.
Paperid:2943
Authors:Yaoxuan Feng · Wenchao Chen · yuxin li · Bo Chen · Yubiao Wang · Zixuan Zhao · Hongwei Liu · Mingyuan Zhou
Abstract:
Diffusion models have demonstrated outstanding performance in industrial anomaly detection. However, their iterative denoising nature results in slow inference speed, limiting their practicality for realtime industrial deployment. To address this challenge, we propose OmiAD, a one-step masked diffusion model for multi-class anomaly detection, derived from a well-designed multi-stepAdaptiveMaskedDiffusionModel (AMDM) and compressed usingAdversarialScoreDistillation (ASD). OmiAD first introduces AMDM, equipped with an adaptive masking strategy that dynamically adjusts masking patterns based on noise levels and encourages the model to reconstruct anomalies as normal counterparts by leveraging broader context, to reduce the pixel-level shortcut reliance. Then, ASD is developed to compress the multi-step diffusion process into a single-step generator by score distillation and incorporating a shared-weight discriminator effectively reusing parameters while significantly improving both inference efficiency and detection performance. The effectiveness of OmiAD is validated on four diverse datasets, achieving state-of-the-art performance across seven metrics while delivering a remarkable inference speedup.
Paperid:2944
Authors:Yuwei Liu · Xin Chen · Jiaming Zhuo · Di Jin · Chuan Wang · Yuanfang Guo · Xiaochun Cao · Zhen Wang
Abstract:
The distribution shifts and the scarcity of labels prevent graph learning methods, especially graph neural networks (GNNs), from generalizing across domains. Compared to Unsupervised Domain Adaptation (UDA) with embedding alignment, Unsupervised Graph Domain Adaptation (UGDA) becomes more challenging in light of the attribute and topology entanglement in the representation. Beyond embedding alignment, UGDA turns to topology alignment but is limited by the ability of the employed topology model and the estimation of pseudo labels. To alleviate this issue, this paper proposed a Disentangled Graph Spectral Domain adaptation (DGSDA) by disentangling attribute and topology alignments and directly aligning flexible graph spectral filters beyond topology. Specifically, Bernstein polynomial approximation, which mimics the behavior of the function to be approximated to a remarkable degree, is employed to capture complicated topology characteristics and avoid the expensive eigenvalue decomposition. Theoretical analysis reveals the tight GDA bound of DGSDA and the rationality of polynomial coefficient regularization. Quantitative and qualitative experiments justify the superiority of the proposed DGSDA.
Paperid:2945
Authors:Xingwu Sun · Shuaipeng Li · Ruobing Xie · Weidong Han · Kan Wu · Zhen Yang · Yixing Li · An Wang · SHUAI LI · Jinbao Xue · Yu Cheng · Yangyu Tao · Zhanhui Kang · ChengZhong Xu · Di Wang · Jie Jiang
Abstract:
Lowprecision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point quantization and thus cannot well fit the LLM losses in this scenario. In contrast, while floating-point quantization training is more commonly implemented in production, the research on it has been relatively superficial. In this paper, we thoroughly explore the effects of floating-point quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in floating-point quantization training performance of LLM models. While presenting an accurate floating-point quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers; (2) We discover the formation of the critical data size in low-precision LLM training. Too much training data exceeding the critical data size will inversely bring in degradation of LLM performance; (3) The optimal floating-point quantization precision is directly proportional to the computational power, but within a wide computational power range, we estimate that the best cost-performance precision lies between 4-8 bits.
Paperid:2946
Authors:Yuchun Miao · Sen Zhang · Liang Ding · Yuqi Zhang · Lefei Zhang · Dacheng Tao
Abstract:
This work identifies theEnergy Loss Phenomenonin Reinforcement Learning from Human Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final layer of a Large Language Model (LLM) gradually increases during the RL process, with anexcessiveincrease in energy loss characterizing reward hacking. Beyond empirical analysis, we further provide a theoretical foundation by proving that, under mild conditions, the increased energy loss reduces the upper bound of contextual relevance in LLMs, which is a critical aspect of reward hacking as the reduced contextual relevance typically indicates overfitting to reward modelfavored patterns in RL. To address this issue, we propose anEnergy loss-aware PPO algorithm (EPPO)which penalizes the increase in energy loss in the LLM's final layer during reward calculation to prevent excessive energy loss, thereby mitigating reward hacking. We theoretically show that EPPO can be conceptually interpreted as an entropy-regularized RL algorithm, which provides deeper insights into its effectiveness. Extensive experiments across various LLMs and tasks demonstrate the commonality of the energy loss phenomenon, as well as the effectiveness of EPPO in mitigating reward hacking and improving RLHF performance.
Paperid:2947
Authors:Xin Zheng · Wei Huang · Chuan Zhou · Ming Li · Shirui Pan
Abstract:
In this work, we address the testtime adaptation challenge in graph neural networks (GNNs), focusing on overcoming the limitations in flexibility and generalization inherent in existing data-centric approaches. To this end, we propose a novel research problem, test-time graph neural dataset search, which seeks to learn a parameterized test-time graph distribution to enhance the inference performance of unseen test graphs on well-trained GNNs. Specifically, we propose a generative Projection based test-time Graph Neural Dataset Search method, named PGNDS, which maps the unseen test graph distribution back to the known training distribution through a generation process guided by well-trained GNNs. The proposed PGNDS framework consists of three key modules: (1) dual conditional diffusion for GNN-guided generative projection through test-back-to-training distribution mapping; (2) dynamic search from the generative sampling space to select the most expressive test graphs; (3) ensemble inference to aggregate information from original and adapted test graphs. Extensive experiments on real-world graphs demonstrate the superior ability of our proposed PGNDS for improved test-time GNN inference.
Paperid:2948
Authors:Yuhang Li · Tong Liu · Yangguang Cui · Ming Hu · Xiaoqiang Li
Abstract:
Federated learning (FL) presents a promising strategy for distributed and privacypreserving learning, yet struggles with performance issues in the presence of heterogeneous data distributions. Recently, a series of works based on sharpness-aware minimization (SAM) have emerged to improve local learning generality, proving to be effective in mitigating data heterogeneity effects.However, most SAM-based methods do not directly consider the global objective and require two backward pass per iteration, resulting in diminished effectiveness.To overcome these two bottlenecks, we leverage the global model trajectory to directly measure sharpness for the global objective, requiring only a single backward pass.We further propose a novel and general algorithm FedGMT to overcome data heterogeneity and the pitfalls of previous SAM-based methods.We analyze the convergence of FedGMT and conduct extensive experiments on visual and text datasets in a variety of scenarios, demonstrating that FedGMT achieves competitive accuracy with state-of-the-art FL methods while minimizing computation and communication overhead.
Paperid:2949
Authors:Yuan Zhang · Chun-Kai Fan · Junpeng Ma · Wenzhao Zheng · Tao Huang · Kuan Cheng · Denis Gudovskiy · Tomoyuki Okuno · Yohei Nakata · Kurt Keutzer · Shanghang Zhang
Abstract:
In visionlanguage models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens. To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, when LLaVA is equipped with SparseVLM, it achieves a 54\% reduction in FLOPs, lowers CUDA time by 37\%, and maintains an accuracy rate of 97\%.
Paperid:2950
Authors:Tom Wollschläger · Jannes Elstner · Simon Markus Geisler · Vincent Cohen-Addad · Stephan Günnemann · Johannes Gasteiger
Abstract:
The safety alignment of stateof-the-art large language models (LLMs) can be circumvented through adversarially crafted inputs, yet how these attacks bypass safety barriers remains poorly understood. Prior work suggests that there existsonerefusal direction that, if "activated," determines whether an LLM refuses a request. In this work, we use a novel gradient-based approach to search for such refusal directions. Contrary to prior work, we find multiple independent directions and even multi-dimensionalconcept conesmediating refusal. Furthermore, we show that orthogonality of directions does not imply independence under intervention, motivating the notion ofrepresentational independenceto capture both linear and non-linear effects and use it to find genuinely independent directions. We demonstrate that exploiting multiple refusal directions can yield higher attack success rates, suggesting that each direction captures complementary aspects of the refusal process. The existence of multi-dimensional refusal cones indicates that refusal mechanisms in LLMs are governed by complex spatial structures while the existence of multiple representationally independent directions emphasizes that different mechanisms exist. Our gradient-based representation engineering uncovers these mechanisms and can serve as a foundation for future work on understanding refusal behavior.
Paperid:2951
Authors:Yash Shah · Camila Gonzalez · MohammadHassan Abbasi · Qingyu Zhao · Kilian M Pohl · Ehsan Adeli
Abstract:
Confounders are extraneous variables that affect both the input and the target, resulting in spurious correlations and biased predictions. There are recent advances in dealing with or removing confounders in traditional models, such as metadata normalization (MDN), where the distribution of the learned features is adjusted based on the study confounders. However, in the context of continual learning, where a model learns continuously from new data over time without forgetting, learning feature representations that are invariant to confounders remains a significant challenge. To remove the influence of confounding variables from intermediate feature representations, we introduce the Recursive MDN (RMDN) layer, which can be integrated into any deep learning architecture--including vision transformers--and at any model stage. R-MDN performs statistical regression via the recursive least squares algorithm to maintain and continually update an internal model state with respect to changing distributions of data and confounding variables. Our experiments demonstrate that R-MDN promotes equitable predictions across population groups, both within static learning and across different stages of continual learning, by reducing catastrophic forgetting caused by confounder effects changing over time.
Paperid:2952
Authors:Yuanzhe Hu · Kinshuk Goel · Vlad Killiakov · Yaoqing Yang
Abstract:
Researchers are increasingly interested in studying the eigenspectrum of weight matrices of deep neural networks (DNNs) and its intricate relationship with model quality. At a high level, eigenspectrum analysis of weight matrices involves measuring the \emph{heavytailness} of the empirical spectral density (ESD) of a weight matrix, and it provides insight into how well a model is trained and can guide decisions related to assigning better layerwise training hyperparameters. In this paper, we address a specific challenge associated with such eigenspectrum methods: the impact of the aspect ratio of weight matrices on estimated heavytailness metrics. Through empirical analysis, we demonstrate that matrices of varying sizes (and aspect ratios) introduce a non-negligible bias in the estimation of heavytailness metrics, leading to inaccurate model diagnosis and hyperparameter scheduling. To overcome this limitation, we propose normalizing the weight matrices by subsampling submatrices with a fixed aspect ratio. Instead of measuring the heavytailness of the original ESD, we measure the average ESD of these subsampled submatrices. We validate our approach across various optimization techniques and application domains that involve eigenspectrum analysis of weights, including image classification in computer vision (CV) models, scientific machine learning (SciML) model training, and large language model (LLM) pruning. Our results show that despite its simplicity, matrix subsampling uniformly improves the accuracy of eigenspectrum analysis while enabling more balanced and effective model optimization in these application domains. In one of the LLM pruning experiments, \ourmethod reduces the perplexity of the LLaMA-7B model by 19.74\% when compared with the state-of-the-art method.
Paperid:2953
Authors:Adeel Pervez · Efstratios Gavves · Francesco Locatello
Abstract:
We present Mechanistic PDE Networks a model for discovery of governingpartial differential equationsfrom data.Mechanistic PDE Networks represent spatiotemporal data as space-time dependentlinearpartial differential equations in neural network hidden representations. The represented PDEs are then solved and decoded for specific tasks. The learned PDE representations naturally express the spatiotemporal dynamics in data in neural network hidden space, enabling increased modeling power. Solving the PDE representations in a compute and memory-efficient way, however, is a significant challenge. We develop a native, GPU-capable, parallel, sparse and differentiable multigrid solver specialized for linear partial differential equations that acts as a module in Mechanistic PDE Networks. Leveraging the PDE solver we propose a discovery architecture that can discovers nonlinear PDEs in complex settings, while being robust to noise. We validate PDE discovery on a number of PDEs including reaction-diffusion and Navier-Stokes equations.
Paperid:2954
Authors:Yi Feng · Kaito Fujii · EFSTRATIOS PANTELEIMON SKOULAKIS · Xiao Wang · Volkan Cevher
Abstract:
Since Polyak's pioneering work, heavy ball (HB) momentum has been widely studied in minimization. However, its role in minmax games remains largely unexplored. As a key component of practical min-max algorithms like Adam, this gap limits their effectiveness. In this paper, we present a continuous-time analysis for HB with simultaneous and alternating update schemes in min-max games. Locally, we provesmallermomentum enhances algorithmic stability by enabling local convergence across a wider range of step sizes, with alternating updates generally converging faster. Globally, we study the implicit regularization of HB, and findsmallermomentum guides algorithms trajectories towards shallower slope regions of the loss landscapes, with alternating updates amplifying this effect. Surprisingly, all these phenomena differ from those observed in minimization, wherelargermomentum yields similar effects. Our results reveal fundamental differences between HB in min-max games and minimization, and numerical experiments further validate our theoretical results.
Paperid:2955
Authors:Ruo-Jing Dong · Yu Yao · Bo Han · Tongliang Liu
Abstract:
Semantic Dependency refers to the relationship between words in a sentence where the meaning of one word depends on another, which is important for natural language understanding.In this paper, we investigate the role of semantic dependencies in answering questions for transformer models, which is achieved by analyzing how token values shift in response to changes in semantics.Through extensive experiments on models including the BERT series, GPT, and LLaMA, we uncover the following key findings:1). Most tokens primarily retain their original semantic information even as they propagate through multiple layers.2). Models can encode truthful semantic dependencies in tokens in the final layer.3). Mistakes in model answers often stem from specific tokens encoded with incorrect semantic dependencies. Furthermore, we found that addressing the incorrectness by directly adjusting parameters is challenging because the same parameters can encode both correct and incorrect semantic dependencies depending on the context.Our findings provide insights into the causes of incorrect information generation in transformers and help the future development of robust and reliable models.
Paperid:2956
Authors:Kai Müller · Martin Wenzel · Tobias Windisch
Abstract:
Many production lines require active control mechanisms, such as adaptiverouting, worker reallocation, and rescheduling, to maintain optimal performance.However, designing these control systems is challenging for various reasons, while reinforcement learning (RL) has shown promise in addressing thesechallenges, a standardized and general framework is still lacking. In this work,we introduce LineFlow, an extensible, opensource Python framework forsimulating production lines of arbitrary complexity and training RL agents tocontrol them. To demonstrate the capabilities and to validate thetheoretical foundations of LineFlow, we formulate core subproblems of active line controlin ways that facilitate mathematical analysis. Foreach problem, we provide optimal solutions for comparison. We benchmarkstate-of-the-art RL algorithms and show that the learned policies approachoptimal performance in well-understood scenarios. However, for more complex,industrial-scale production lines, RL still faces significant challenges,highlighting the need for further research in areas such as reward shaping,curriculum learning, and hierarchical control.
Paperid:2957
Authors:Guoqing Chao · Zhenghao Zhang · Lei Meng · Jie Wen · Dianhui Chu
Abstract:
Federated multiview clustering has been proposed to mine the valuable information within multi-view data distributed across different devices and has achieved impressive results while preserving the privacy. Despite great progress, most federated multi-view clustering methods only used global pseudo-labels to guide the downstream clustering process and failed to exploit the global information when extracting features. In addition, missing data problem in federated multi-view clustering task is less explored. To address these problems, we propose a novel Federated Incomplete Multi-view Clustering method with globally Fused Graph guidance (FIMCFG). Specifically, we designed a dual-head graph convolutional encoder at each client to extract two kinds of underlying features containing global and view-specific information. Subsequently, under the guidance of the fused graph, the two underlying features are fused into high-level features, based on which clustering is conducted under the supervision of pseudo-labeling. Finally, the high-level features are uploaded to the server to refine the graph fusion and pseudo-labeling computation. Extensive experimental results demonstrate the effectiveness and superiority of FIMCFG.
Paperid:2958
Authors:Zeyuan Ma · Zhiguang Cao · Zhou Jiang · Hongshu Guo · Yue-Jiao Gong
Abstract:
Recent progress in MetaBlack-Box-Optimization (MetaBBO) has demonstrated that using RL to learn a meta-level policy for dynamic algorithm configuration (DAC) over an optimization task distribution could significantly enhance the performance of the low-level BBO algorithm. However, the online learning paradigms in existing works makes the efficiency of MetaBBO problematic.To address this, we propose an offline learning-based MetaBBO framework in this paper, termed Q-Mamba, to attain both effectiveness and efficiency in MetaBBO. Specifically, we first transform DAC task into long-sequence decision process. This allows us further introduce an effective Q-function decomposition mechanism to reduce the learning difficulty within the intricate algorithm configuration space. Under this setting, we propose three novel designs to meta-learn DAC policy from offline data: we first propose a novel collection strategy for constructing offline DAC experiences dataset with balanced exploration and exploitation. We then establish a decomposition-based Q-loss that incorporates conservative Q-learning to promote stable offline learning from the offline dataset. To further improve the offline learning efficiency, we equip our work with a Mamba architecture which helps long-sequence learning effectiveness and efficiency by selective state model and hardware-aware parallel scan respectively. Through extensive benchmarking, we observe that Q-Mamba achieves competitive or even superior performance to prior online/offline baselines, while significantly improving the training efficiency of existing online baselines. We provide sourcecodes of Q-Mamba at https://anonymous.4open.science/r/Q-Mamba-860A.
Paperid:2959
Authors:Mateo Espinosa Zarlenga · Gabriele Dominici · Pietro Barbiero · Zohreh Shams · Mateja Jamnik
Abstract:
In this paper, we investigate how conceptbased models (CMs) respond to out-of-distribution (OOD) inputs. CMs are interpretable neural architectures that first predict a set of high-levelconcepts(e.g., "stripes", "black") and then predict a task label from those concepts. In particular, we study the impact ofconcept interventions(i.e., operations where a human expert corrects a CM’s mispredicted concepts at test time) on CMs' task predictions when inputs are OOD. Our analysis reveals a weakness in current state-of-the-art CMs, which we termleakage poisoning, that prevents them from properly improving their accuracy when intervened on for OOD inputs. To address this, we introduceMixCEM, a new CM that learns to dynamically exploit leaked information missing from its concepts only when this information is in-distribution. Our results across tasks with and without complete sets of concept annotations demonstrate that MixCEMs outperform strong baselines by significantly improving their accuracy for both in-distribution and OOD samples in the presence and absence of concept interventions.
Paperid:2960
Authors:Haoxuan Li · Zeyu Tang · Zhichao Jiang · Zhuangyan Fang · Yue Liu · zhi geng · Kun Zhang
Abstract:
Fairness in human and algorithmic decisionmaking is crucial in areas such as criminal justice, education, and social welfare. Recently, counterfactual fairness has drawn increasing research interest, suggesting that decision-making for individuals should remain the same when intervening with different values on protected attributes. Nevertheless, the question of "which attributes and individuals should be protected" is rarely discussed in the existing counterfactual fairness literature. For example, when considering leg disability as a protected attribute, the algorithms should not treat individuals with leg disabilities differently in college admissions, but one may naturally consider this factor when selecting runner athletes. In other words, when and how to enforce fairness is expected to depend on the causal relation between the protected attribute and the outcome of interest. Formally, this paper proposes principal counterfactual fairness using the concept of principal stratification from the causal inference literature, focusing on whether an algorithm is counterfactually fair for individuals whose protected attribute has no individual causal effect on the outcome of interest. To examine whether an algorithm satisfies principal counterfactual fairness, we derive the statistical bounds and propose a post-processing approach to achieving principal counterfactual fairness with minimal individual decision changes. Experiments are conducted using synthetic and real-world datasets to verify the effectiveness of our methods.
Paperid:2961
Authors:Zhiqiang Wang · Xiaoyi Wang · Jianqing Liang
Abstract:
Graph Neural Ordinary Differential Equations (GODEs) integrate the Variational Autoencoder (VAE) framework with differential equations, effectively modeling latent space uncertainty and continuous dynamics, excelling in graph data evolution and incompleteness. However, existing GODEs face challenges in capturing timevarying relationships and nonlinear node state evolution, which limits their ability to model complex dynamic graphs. To address these issues, we propose the ControlSynth Graph ODE (CSG-ODE). In the VAE encoding phase, CSG-ODE introduces an information transmission-based inter-node importance weighting mechanism, integrating it with latent correlations to guide adaptive graph convolutional recurrent networks for temporal node embedding. During decoding, CSG-ODE employs ODEs to model node dynamics, capturing nonlinear evolution through sub-networks with nonlinear activations. For high-stability scenarios, we extend CSG-ODE to stable CSG-ODE (SCSG-ODE) by constraining weight matrices to learnable anti-symmetric forms, theoretically ensuring enhanced stability. Experiments on traffic, motion capture, and simulated physical systems datasets demonstrate that CSG-ODE outperforms state-of-the-art GODEs, while SCSG-ODE achieves both superior performance and optimal stability.
Paperid:2962
Authors:Yusong Wu · Christos Tsirigotis · Ke Chen · Cheng-Zhi Anna Huang · Aaron Courville · Oriol Nieto · Prem Seetharaman · Justin Salamon
Abstract:
Recent multimodal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.
Paperid:2963
Authors:Cheng · Hadi Amiri
Abstract:
Toolaugmented large language models (LLMs) may need to forget learned tools due to security concerns, privacy restrictions, or deprecated tools. However, ``tool unlearning'' has not been investigated in machine unlearning literature. We introduce this novel task, which requires addressing distinct challenges compared to traditional unlearning: knowledge removal rather than forgetting individual samples, the high cost of optimizing LLMs, and the need for principled evaluation metrics. To bridge these gaps, we propose ToolDelete , the first approach for unlearning tools from tool-augmented LLMs which implements three properties for effective tool unlearning, and a new membership inference attack (MIA) model for evaluation. Experiments on three tool learning datasets and tool-augmented LLMs show that ToolDelete effectively unlearns both randomly selected and category-specific tools, while preserving the LLM's knowledge on non-deleted tools and maintaining performance on general tasks.
Paperid:2964
Authors:Zhen Wang · Yilei JIANG · Dong Zheng · Jun Xiao · Long Chen
Abstract:
Customized Image Generation, generating customized images with userspecified concepts, has raised significant attention due to its creativity and novelty. With impressive progress achieved in subject customization, some pioneer works further explored the customization of action and interaction beyond entity (i.e., human, animal, and object) appearance. However, these approaches only focus on basic actions and interactions between two entities, and their effects are limited by insufficient ''exactly same'' reference images. To extend customized image generation to more complex scenes for general real-world applications, we propose a new task: event-customized image generation. Given a single reference image, we define the ''event'' as all specific actions, poses, relations, or interactions between different entities in the scene. This task aims at accurately capturing the complex event and generating customized images with various target entities. To solve this task, we proposed a novel training-free event customization method: FreeEvent. Specifically, FreeEvent introduces two extra paths alongside the general diffusion denoising process: 1) Entity switching path: it applies cross-attention guidance and regulation for target entity generation. 2) Event transferring path: it injects the spatial feature and self-attention maps from the reference image to the target image for event generation. To further facilitate this new task, we collected two evaluation benchmarks: SWiG-Event and Real-Event. Extensive experiments and ablations have demonstrated the effectiveness of FreeEvent.
Paperid:2965
Authors:Mauricio Soroco · Jialin Song · Mengzhou Xia · Kye Emond · Weiran Sun · Wuyang Chen
Abstract:
We present PDEController, a framework that enables large language models (LLMs) to control systems governed by partial differential equations (PDEs). Traditional LLMs have excelled in commonsense reasoning but fall short in rigorous logical reasoning. While recent AI-for-math has made strides in pure mathematics, areas of applied mathematics, particularly PDEs, remain underexplored despite their significant real-world applications. Our approach enables LLMs to transform informal natural language instructions into formal specifications, and then execute reasoning and planning steps to improve the utility of PDE control. We build a holistic solution comprising datasets (both human-written cases and 2 million synthetic samples), math-reasoning models, and novel evaluation metrics, all of which require significant effort. Our PDE-Controller significantly outperforms the latest open-source and GPT models in reasoning, autoformalization, and program synthesis, achieving up to a 62% improvement in utility gain for PDE control. By bridging the gap between language generation and PDE systems, we demonstrate the potential of LLMs in addressing complex scientific and engineering challenges. We promise to release all data, model checkpoints, and code upon acceptance.
Paperid:2966
Authors:Yuanchao Dai · Ximing Li · Changchun Li
Abstract:
Training a precise binary classifier with limited supervision in weakly supervised learning scenarios holds considerable research significance in practical settings. Leveraging pairwise unlabeled data with confidence differences has been demonstrated to outperform learning from pointwise unlabeled data. We theoretically analyze the various supervisory signals reflected by confidence differences in confidence difference (ConfDiff) classification and identify challenges arising from noisy signals when confidence differences are small. To address this, we partition the dataset into two subsets with distinct supervisory signals and propose a consistency regularizationbased risk estimator to encourage similar outputs for similar instances, mitigating the impact of noisy supervision. We further derive and analyze its estimation error bounds theoretically. Extensive experiments on benchmark and UCI datasets demonstrate the effectiveness of our method. Additionally, to effectively capture the influence of real-world noise on the confidence difference, we artificially perturb the confidence difference distribution and demonstrate the robustness of our method under noisy conditions through comprehensive experiments.
Paperid:2967
Authors:Xutong Liu · Xiangxiang Dai · Jinhang Zuo · Siwei Wang · Carlee Joe-Wong · John C. S. Lui · Wei Chen
Abstract:
The combinatorial multiarmed bandit (CMAB) is a fundamental sequential decision-making framework, extensively studied over the past decade. However, existing work primarily focuses on the online setting, overlooking the substantial costs of online interactions and the readily available offline datasets.To overcome these limitations, we introduce Off-CMAB, the first offline learning framework for CMAB. Central to our framework is the combinatorial lower confidence bound (CLCB) algorithm, which combines pessimistic reward estimations with combinatorial solvers. To characterize the quality of offline datasets, we propose two novel data coverage conditions and prove that, under these conditions, CLCB achieves a near-optimal suboptimality gap, matching the theoretical lower bound up to a logarithmic factor.We validate Off-CMAB through practical applications, including learning to rank, large language model (LLM) caching, and social influence maximization, showing its ability to handle nonlinear reward functions, general feedback models, and out-of-distribution action samples that excludes optimal or even feasible actions. Extensive experiments on synthetic and real-world datasets further highlight the superior performance of CLCB.
Paperid:2968
Authors:Dahun Shin · Dongyeop Lee · Jinseok Chung · Namhoon Lee
Abstract:
Approximate secondorder optimization methods often exhibit poorer generalization compared to first-order approaches. In this work, we look into this issue through the lens of the loss landscape and find that existing second-order methods tend to converge to sharper minima compared to SGD.In response, we propose Sassha, a novel second-order method designed to enhance generalization by explicitly reducing sharpness of the solution, while stabilizing the computation of approximate Hessians along the optimization trajectory.In fact, this sharpness minimization scheme is crafted also to accommodate lazy Hessian updates, so as to secure efficiency besides flatness.To validate its effectiveness, we conduct a wide range of standard deep learning experiments where Sassha demonstrates its outstanding generalization performance that is comparable to, and mostly better than, other methods.We provide a comprehensive set of analyses including convergence, robustness, stability, efficiency, and cost.
Paperid:2969
Authors:Stefano Cortinovis · Francois Caron
Abstract:
Predictionpowered inference (PPI) enables valid statistical inference by combining experimental data with machine learning predictions. When a sufficient number of high-quality predictions is available, PPI results in more accurate estimates and tighter confidence intervals than traditional methods. In this paper, we propose to inform the PPI framework with prior knowledge on the quality of the predictions. The resulting method, which we call frequentist, assisted by Bayes, PPI (FAB-PPI), improves over PPI when the observed prediction quality is likely under the prior, while maintaining its frequentist guarantees. Furthermore, when using heavy-tailed priors, FAB-PPI adaptively reverts to standard PPI in low prior probability regions. We demonstrate the benefits of FAB-PPI in real and synthetic examples.
Paperid:2970
Authors:Jie Wang · March Boedihardjo · Yao Xie
Abstract:
Abstract:Optimal transport has been very successful for various machine learning tasks; however, it is known to suffer from the curse of dimensionality. Hence, dimensionality reduction is desirable when applied to highdimensional data with low-dimensional structures. The kernel max-sliced (KMS) Wasserstein distance is developed for this purpose by finding an optimal nonlinear mapping that reduces data into $1$ dimension before computing the Wasserstein distance. However, its theoretical properties have not yet been fully developed. In this paper, we provide sharp finite-sample guarantees under milder technical assumptions compared with state-of-the-art for the KMS $p$-Wasserstein distance between two empirical distributions with $n$ samples for general $p\in[1,\infty)$. Algorithm-wise, we show that computing the KMS $2$-Wasserstein distance is NP-hard, and then we further propose a semidefinite relaxation (SDR) formulation (which can be solved efficiently in polynomial time) and provide a relaxation gap for the obtained solution. We provide numerical examples to demonstrate the good performance of our scheme for high-dimensional two-sample testing.
Paperid:2971
Authors:Evan Frick · Connor Chen · Joseph Tennyson · Tianle Li · Wei-Lin Chiang · Anastasios Angelopoulos · Ion Stoica
Abstract:
Language model evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures userand prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that predicts prompt-specific leaderboards by training large language models on human preference data. The core idea is to train LLMs to output the coefficients of parametric regressions that represent per-prompt leaderboards. The resulting prompt-dependent leaderboards facilitate optimal routing of queries to models, personalized leaderboards, unsupervised task-specific performance analysis, and automated evaluation of model strengths and weaknesses. Live results from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the #1 spot in the Chatbot Arena leaderboard.
Paperid:2972
Authors:Shuai Yuan · Hongwei Li · Rui Zhang · Hangcheng Cao · Wenbo Jiang · Tao Ni · Wenshu Fan · Qingchuan Zhao · Guowen Xu
Abstract:
Deep learning models employed in face recognition (FR) systems have been shown to be vulnerable to physical adversarial attacks through various modalities, including patches, projections, and infrared radiation. However, existing adversarial examples targeting FR systems often suffer from issues such as conspicuousness, limited effectiveness, and insufficient robustness. To address these challenges, we propose a novel approach for adversarial face generation, UVHat, which utilizes ultraviolet (UV) emitters mounted on a hat to enable invisible and potent attacks in blackbox settings. Specifically, UVHat simulates UV light sources via video interpolation and models the positions of these light sources on a curved surface, specifically the human head in our study. To optimize attack performance, UVHat integrates a reinforcement learning-based optimization strategy, which explores a vast parameter search space, encompassing factors such as shooting distance, power, and wavelength. Extensive experimental evaluations validate that UVHat substantially improves the attack success rate in black-box settings, enabling adversarial attacks from multiple angles with enhanced robustness.
Paperid:2973
Authors:Yishan Shen · Yuyang Ye · Hui Xiong · Yong Chen
Abstract:
Dynamic treatment regimes (DTRs) are critical to precision medicine, optimizing longterm outcomes through personalized, real-time decision-making in evolving clinical contexts, but require careful supervision for unsafe treatment risks. Existing efforts rely primarily on clinician-prescribed gold standards despite the absence of a known optimal strategy, and predominantly using structured EHR data without extracting valuable insights from clinical notes, limiting their reliability for treatment recommendations. In this work, we introduce SAFER, a calibrated risk-aware tabular-language recommendation framework for DTR that integrates both structured EHR and clinical notes, enabling them to learn from each other, and addresses inherent label uncertainty by assuming ambiguous optimal treatment solution for deceased patients. Moreover, SAFER employs conformal prediction to provide statistical guarantees, ensuring safe treatment recommendations while filtering out uncertain predictions. Experiments on two publicly available sepsis datasets demonstrate that SAFER outperforms state-of-the-art baselines across multiple recommendation metrics and counterfactual mortality rate, while offering robust formal assurances. These findings underscore SAFER’s potential as a trustworthy and theoretically grounded solution for high-stakes DTR applications.
Paperid:2974
Authors:Ada Tur · Nicholas Meade · Xing Han Lù · Alejandra Zambrano · Arkil Patel · Esin Durmus · Spandana Gella · Karolina Stanczak · Siva Reddy
Abstract:
LLMbased agents are becoming increasingly proficient at solving web-based tasks. With this increased capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories---misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate several leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2 72B, and Llama-3.2 90B, on our benchmark. We find that agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 22.8% and 26.0% of the harmful intents, respectively. Furthermore, we show that the susceptibility of web agents to malicious requests can be increased by priming the agent by conditioning on a partially completed malicious task; this increases attack success rates across all agents studied by as much as 18% (>72% relative increase) in some cases. Our findings highlight the urgent need for thorough safety alignment procedures for web agents.
Paperid:2975
Authors:Konrad Mundinger · Max Zimmer · Aldo Kiem · Christoph Spiegel · Sebastian Pokutta
Abstract:
We demonstrate how neural networks can drive mathematical discovery through a case study of the HadwigerNelson problem, a long-standing open problem from discrete geometry and combinatorics about coloring the plane avoiding monochromatic unit-distance pairs. Using neural networks as approximators, we reformulate this mixed discrete-continuous geometric coloring problem as an optimization task with a probabilistic, differentiable loss function. This enables gradient-based exploration of admissible configurations that most significantly led to the discovery of two novel six-colorings, providing the first improvements in thirty years to the off-diagonal variant of the original problem (Anonymous, 2024). Here, we establish the underlying machine learning approach used to obtain these results and demonstrate its broader applicability through additional results and numerical insights.
Paperid:2976
Authors:Hanyu Wang · Bochuan Cao · Yuanpu Cao · Jinghui Chen
Abstract:
Large language models (LLMs) are known to struggle with consistently generating truthful responses. While various representation intervention techniques have been proposed, these methods typically apply a universal representation correction vector to all input queries, limiting their effectiveness against diverse queries in practice. In this study, we introduce TruthFlow, a novel method that leverages the Flow Matching technique for queryspecific truthful representation correction. Specifically, TruthFlow first uses a flow model to learn query-specific correction vectors that transition representations from hallucinated to truthful states. Then, during inference, the trained flow model generates these correction vectors to enhance the truthfulness of LLM outputs. Experimental results demonstrate that TruthFlow significantly improves performance on open-ended generation tasks across various advanced LLMs evaluated on TruthfulQA. Moreover, the trained TruthFlow model exhibits strong transferability, performing effectively on other unseen hallucination benchmarks.
Paperid:2977
Authors:Antoni Kowalczuk · Jan Dubiński · Franziska Boenisch · Adam Dziedzic
Abstract:
Image AutoRegressive generation has emerged as a new powerful paradigm with image autoregressive models (IARs) surpassing stateof-the-art diffusion models (DMs) in both image quality (FID: 1.48 vs. 1.58) and generation speed. However, the privacy risks associated with IARs remain unexplored, raising concerns regarding their responsible deployment. To address this gap, we conduct a comprehensive privacy analysis of IARs, comparing their privacy risks to the ones of DMs as reference points. Concretely, we develop a novel membership inference attack (MIA) that achieves a remarkably high success rate in detecting training images (with a TPR@FPR=1\% of 86.38\% vs. 4.91\% for DMs with comparable attacks). We leverage our novel MIA to provide dataset inference (DI) for IARs, and show that it requires as few as 6 samples to detect dataset membership (compared to 200 for DI in DMs), confirming a higher information leakage in IARs. Finally, we are able to extract hundreds of training data points from an IAR (e.g., 698 from VAR-d30). Our results demonstrate a fundamental privacy-utility trade-off: while IARs excel in image generation quality and speed, they are significantly more vulnerable to privacy attacks compared to DMs. This trend suggests that utilizing techniques from DMs within IARs, such as modeling the per-token probability distribution using a diffusion procedure, holds potential to help mitigating IARs' vulnerability to privacy attacks.
Paperid:2978
Authors:Rishabh Tiwari · Haocheng Xi · Aditya Tomar · Coleman Hooper · Sehoon Kim · Maxwell Horton · Mahyar Najibi · Michael Mahoney · Kurt Keutzer · Amir Gholaminejad
Abstract:
Abstract:Large Language Models (LLMs) are increasingly being deployed on edge devices for longcontext settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates ($>$90\%) and reliably provides consistent end-to-end speedups upto $\sim2.5\times$, outperforming other self-speculative decoding methods that use sparse KV cache for long-context LLM inference. QuantSpec also reduces the memory requirements by $\sim 1.3\times$ compared to these alternatives.
Paperid:2979
Authors:Hui Nie · Zhao Zhang · Yutao Cheng · Maoke Yang · Gonglei Shi · Qingsong Xie · Xinglong Wu · Jie Shao
Abstract:
We introduce a novel task called Layer Decomposition (LD), which focuses on decomposing a composite image (such as a poster) into individual and complete RGBA layers, complete with additional metadata such as coordinates and the order of layers. This is particularly difficult because layers can hide parts of one another, and unraveling these complexities is essential for accurate restoration. LD transforms an image into structured data, offering significant benefits for the creation and collection of digital materials, as well as enabling precise image modification. We develop the Decompose Layer Model (DeaM), a Large Multimodal Model. DeaM integrates a conjoined visual encoder, a language model, and a condition-aware RGB-A decoder. When processing an image, DeaM generates the metadata for each layer in the form of a JSON object. This metadata includes quantization encodings that represent each layer. These encodings are then used by the RGB-A Decoder to reconstruct the layer with its transparent channel intact. In addition to automatically decomposing all layers, DeaM can also receive text or point prompts to decompose specific layers. We prompt GPT-4 to collect instruction-tuning data for open queries.
Paperid:2980
Authors:Yong Liu · Di Fu · Shenggan Cheng · Zirui Zhu · Yang Luo · Minhao Cheng · Cho-Jui Hsieh · Yang You
Abstract:
Despite LowRank Adaptation (LoRA)'s popularity for fine-tuning large models, it often exhibits a noticeable performance gap compared to full fine-tuning, particularly in complex tasks such as mathematical reasoning and code generation. Motivated by this discrepancy, we propose a novel fusion approach for LoRA fine-tuned models. Our key insight is that LoRA models trained with different random seeds on the same task often exhibit complementary strengths. In contrast to existing research that typically focuses on fusing models trained on diverse tasks, we explore the potential of combining multiple LoRA models fine-tuned on the same task with different random seeds. This intra-task fusion method aims to leverage the strengths of various fine-tuned models to create a more robust and effective adaptation. To validate our approach, we conducted comprehensive experiments across three key areas: mathematical reasoning, code generation, and general instruction-tuning tasks. The results demonstrate that our fusion method significantly enhances LoRA's performance, outperforming both standalone LoRA models and current fusion methods. Notably, this advancement substantially narrows the gap between LoRA and full fine-tuning, thus offering a more effective approach to model adaptation without the GPU memory burden of full parameter fine-tuning.
Paperid:2981
Authors:Nannan Wu · Yuming Huang · Yiming Zhao · Jie Chen · Wenjun Wang
Abstract:
Subgraph representation learning has attracted growing interest due to its wide applications in various domains. However, existing methods primarily focus on local neighborhood structures while overlooking the significant impact of global structural information, in particular the influence from multihop neighbors beyond immediate neighborhoods. This presents two key challenges: how to effectively capture the structural relationships between distant nodes, and how to prevent excessive aggregation of global structural information from weakening the discriminative ability of subgraph representations.To address these challenges, we propose GPEN (Global Position Encoding Network). GPEN leverages a hierarchical tree structure to encode each node's global position based on its path distance to the root node, enabling a systematic way to capture relationships between distant nodes. Furthermore, we introduce a boundary-aware convolution module that selectively integrates global structural information while maintaining the unique structural patterns of each subgraph. Extensive experiments on eight public datasets identify that GPEN significantly outperforms state-of-the-art methods in subgraph representation learning.
Paperid:2982
Authors:Anders Aamand · Justin Chen · Mina Dalirrooyfard · Slobodan Mitrovic · Yuriy Nevmyvaka · Sandeep Silwal · Yinzhan Xu
Abstract:
Abstract:We study differentially private algorithms for graph cut sparsification, a fundamental problem in algorithms, privacy, and machine learning. While significant progress has been made, the bestknown private and efficient cut sparsifiers on $n$-node graphs approximate each cut within $\widetilde{O}(n^{1.5})$ additive error and $1+\gamma$ multiplicative error for any $\gamma > 0$ [Gupta, Roth, Ullman TCC'12]. In contrast, \emph{inefficient} algorithms, i.e., those requiring exponential time, can achieve an $\widetilde{O}(n)$ additive error and $1+\gamma$ multiplicative error [Eli{\'a}{\v{s}}, Kapralov, Kulkarni, Lee SODA'20]. In this work, we break the $n^{1.5}$ additive error barrier for private and efficient cut sparsification. We present an $(\varepsilon,\delta)$-DP polynomial time algorithm that, given a non-negative weighted graph, outputs a private synthetic graph approximating all cuts with multiplicative error $1+\gamma$ and additive error $n^{1.25 + o(1)}$ (ignoring dependencies on $\varepsilon, \delta, \gamma$). At the heart of our approach lies a private algorithm for expander decomposition, a popular and powerful technique in (non-private) graph algorithms.
Paperid:2983
Authors:François Cornet · Federico Bergamin · Arghya Bhowmik · Juan Garcia-Lastra · Jes Frellsen · Mikkel Schmidt
Abstract:
Generative modeling of crystalline materials using diffusion models presents a series of chal-lenges: the data distribution is characterized by in-herent symmetries and involves multiple modali-ties, with some defined on specific manifolds. No-tably, the treatment of fractional coordinates rep-resenting atomic positions in the unit cell requirescareful consideration, as they lie on a hypertorus.In this work, we introduce Kinetic Langevin Dif-fusion for Materials (KLDM), a novel diffusionmodel for crystalline materials generation, wherethe key innovation resides in the modeling of thecoordinates. Instead of resorting to Riemanniandiffusion on the hypertorus directly, we generalizeTrivialized Diffusion Models (TDM) to accountfor the symmetries inherent to crystals. By cou-pling coordinates with auxiliary Euclidean vari-ables representing velocities, the diffusion pro-cess is now offset to a flat space. This allows usto effectively perform diffusion on the hypertoruswithout approximation, while providing a train-ing objective consistent with the periodic transla-tion symmetry of the true data distribution. Weevaluate KLDM on both Crystal Structure Predic-tion (CSP) and De-novo Generation (DNG) tasks,demonstrating its competitive performance withcurrent state-of-the-art models.
Paperid:2984
Authors:Lorenz Kummer · Samir Moustafa · Anatol Ehrlich · Franka Bause · Nikolaus Suess · Wilfried Gansterer · Nils M. Kriege
Abstract:
The lottery ticket hypothesis (LTH) is wellstudied for convolutional neural networks but has been validated only empirically for graph neural networks (GNNs), for which theoretical findings are largely lacking. In this paper, we identify the expressivity of sparse subnetworks, i.e. their ability to distinguish non-isomorphic graphs, as crucial for finding winning tickets that preserve the predictive performance.We establish conditions under which the expressivity of a sparsely initialized GNN matches that of the full network, particularly when compared to the Weisfeiler-Leman test and subsequently show that an increased expressivity in the initialization potentially accelerates model convergence and improves generalization. Our findings establish novel theoretical foundations for both LTH and GNN research, highlighting the importance of maintaining expressivity in sparsely initialized GNNs. We illustrate our results using examples from drug discovery.
Paperid:2985
Authors:Ziming Zhu · Yu Zhu · Jiahao Chen · Xiaofeng Ling · Huanlei Chen · Lihua Sun
Abstract:
Recently, imagebased 3D semantic occupancy prediction has become a hot topic in 3D scene understanding for autonomous driving. Compared with the bounding box form of 3D object detection, the ability to describe the fine-grained contours of any obstacles in the scene is the key insight of voxel occupancy representation, which facilitates subsequent tasks of autonomous driving. In this work, we propose CSV-Occ to address the following two challenges: (1) Existing methods fuse temporal information based on the attention mechanism, but are limited by high complexity. We extend the state space model to support multi-input sequence interaction and conduct temporal modeling in a cascaded architecture, thereby reducing the computational complexity from quadratic to linear. (2) Existing methods are limited by semantic ambiguity, resulting in the centers of foreground objects often being predicted as empty voxels. We enable the model to explicitly vote for the instance center to which the voxels belong and spontaneously learn to utilize the other voxel features of the same instance to update the semantics of the internal vacancies of the objects from coarse to fine. Experiments on the Occ3D-nuScenes dataset show that our method achieves state-of-the-art in camera-based 3D semantic occupancy prediction and also performs well on lidar point cloud semantic segmentation on the nuScenes dataset. Therefore, we believe that CSV-Occ is beneficial to the community and industry of autonomous vehicles.
Paperid:2986
Authors:Aaron Wang · William Convertino · Xiang Cheng · Ricardo Henao · Lawrence Carin
Abstract:
Incontext learning based on attention models is examined for data with categorical outcomes, with inference in such models viewed from the perspective of functional gradient descent (GD). We develop a network composed of attention blocks, with each block employing a self-attention layer followed by a cross-attention layer, with associated skip connections. This model can exactly perform multi-step functional GD inference for in-context inference with categorical observations. We perform a theoretical analysis of this setup, generalizing many prior assumptions in this line of work, including the class of attention mechanisms for which it is appropriate. We demonstrate the framework empirically on synthetic data, image classification and language generation.
Paperid:2987
Authors:Xuandong Zhao · Xianjun Yang · Tianyu Pang · Chao Du · Lei Li · Yu-Xiang Wang · William Wang
Abstract:
Large language models (LLMs) are vulnerable to jailbreak attacks resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack's key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model's decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99\% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. We provide the code in the supplementary materials.
Paperid:2988
Authors:Pouria Fatemi · Ehsan Sharifian · Mohammad Hossein Yassaee
Abstract:
Counterfactual explanations enhance interpretability by identifying alternative inputs that produce different outputs, offering localized insights into model decisions. However, traditional methods often neglect causal relationships, leading to unrealistic examples. While newer approaches integrate causality, they are computationally expensive. To address these challenges, we propose an efficient method based on backtracking counterfactuals that incorporates causal reasoning to generate actionable explanations. We first examine the limitations of existing methods and then introduce our novel approach and its features. We also explore the relationship between our method and previous techniques, demonstrating that it generalizes them in specific scenarios. Finally, experiments show that our method provides deeper insights into model outputs.
Paperid:2989
Authors:Zhiyang Wang · Juan Cervino · Alejandro Ribeiro
Abstract:
Graph Neural Networks (GNNs) extend convolutional neural networks to operate on graphs. Despite their impressive performances in various graph learning tasks, the theoretical understanding of their generalization capability is still lacking. Previous GNN generalization bounds ignore the underlying graph structures, often leading to bounds that increase with the number of nodes – a behavior contrary to the one experienced in practice. In this paper, we take a manifold perspective to establish the statistical generalization theory of GNNs on graphs sampled from a manifold in the spectral domain. As demonstrated empirically, we prove that the generalization bounds of GNNs decrease linearly with the size of the graphs in the logarithmic scale, and increase linearly with the spectral continuity constants of the filter functions. Notably, our theory explains both nodelevel and graph-level tasks. Our result has two implications: i) guaranteeing the generalization of GNNs to unseen data over manifolds; ii) providing insights into the practical design of GNNs, i.e., restrictions on the discriminability of GNNs are necessary to obtain a better generalization performance. We demonstrate our generalization bounds of GNNs using synthetic and multiple real-world datasets.
Paperid:2990
Authors:Zhuowei Li · Haizhou Shi · Yunhe Gao · Di Liu · Zhenting Wang · Yuxiao Chen · Ting Liu · Long Zhao · Hao Wang · Dimitris Metaxas
Abstract:
Large VisionLanguage Models (LVLMs) can reason effectively over both textual and visual inputs, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits rankings throughout the generation process, revealing three key patterns in how LVLMs process information: (1)gradual visual information loss-- visually grounded tokens gradually become less favored throughout generation, and (2)early excitation-- semantically meaningful tokens achieve peak activation in the layers earlier than the final layer.(3)hidden genuine information-- visually grounded tokens though not being eventually decided still retain relatively high rankings at inference.Based on these insights, we proposeVISTA(VisualInformationSteering withToken-logitAugmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information. VISTA works by combining two complementary approaches: reinforcing visual information in activation space and leveraging early layer activations to promote semantically meaningful decoding. Compared to existing methods, VISTA requires no external supervision and is applicable to various decoding strategies. Extensive experiments show that VISTA on average reduces hallucination by abount 40% on evaluated open-ended generation task, and it consistently outperforms existing methods on four benchmarks across four architectures under three decoding strategies.
Paperid:2991
Authors:Zian Li · Cai Zhou · Xiyuan Wang · Xingang Peng · Muhan Zhang
Abstract:
Recent advancements in molecular generative models have demonstrated substantial potential in accelerating scientific discovery, particularly in drug design. However, these models often face challenges in generating highquality molecules, especially in conditional scenarios where specific molecular properties must be satisfied. In this work, we introduce GeoRCG, a general framework to enhance the performance of molecular generative models by integrating geometric representation conditions with provable theoretical guarantees. We decompose the molecule generation process into two stages: first, generating an informative geometric representation; second, generating a molecule conditioned on the representation. Compared to directly generating a molecule, the relatively easy-to-generate representation in the first stage guides the second-stage generation to reach a high-quality molecule in a more goal-oriented and much faster way. Leveraging EDM and SemlaFlow as the base generators, we observe significant quality improvements in unconditional molecule generation tasks on the widely-used QM9 and GEOM-DRUG datasets. More notably, in the challenging conditional molecular generation task, our framework achieves an average 31\% performance improvement over state-of-the-art approaches, highlighting the superiority of conditioning on semantically rich geometric representations over conditioning on individual property values as in previous approaches. Furthermore, we show that, with such representation guidance, the number of diffusion steps can be reduced to as small as 100 while largely preserving the generation quality achieved with 1,000 steps, thereby significantly accelerating the generation process.
Paperid:2992
Authors:Grace Luo · Trevor Darrell · Amir Bar
Abstract:
Autoregressive visionlanguage models (VLMs) can handle many tasks within a single model, yet the representations that enable this capability remain opaque. We find that VLMs align conceptually equivalent inputs into a shared task vector, which is invariant to modality (text, image) and format (examples, instruction), and may simplify VLM processing. We measure this alignment via cross-modal transfer--the ability of a task vector derived in one modality to trigger the correct generation in another--on a range of tasks and model architectures. Although the task vector is highly compressed, we find that this single vector outperforms prompting the model with the full task information, unique to this cross-modal case.Furthermore, we show that task vectors can be transferred from a base language model to its fine-tuned vision-language counterpart, and that they can be derived solely from instructions without the need for examples. Taken together, our findings shed light on how VLMs internally process task information, and how they map different modalities into common semantic representations.
Paperid:2993
Authors:Jintao Zhang · Chendong Xiang · Haofeng Huang · Jia wei · Haocheng Xi · Jun Zhu · Jianfei Chen
Abstract:
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map.A universal sparse attention that guarantees both the speedup and endto-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics.
Paperid:2994
Authors:Chris Pedersen · Laure Zanna · Joan Bruna
Abstract:
Autoregressive surrogate models (oremulators) of spatiotemporal systems provide an avenue for fast, approximate predictions, with broad applications across science and engineering. At inference time however, these models are generally unable to provide predictions over long time rollouts due to accumulation of errors leading to diverging trajectories. In essence, emulators operate out of distribution, and controlling the online distribution quickly becomes intractable in largescale settings. To address this fundamental issue, and focusing on time-stationary systems admitting an invariant measure, we leverage diffusion models to obtain an implicit estimator of the score of this invariant measure. We show that this model of the score function can be used to stabilize autoregressive emulator rollouts by applying on-the-fly denoising during inference, a process we callthermalization. Thermalizing an emulator rollout is shown to extend the time horizon of stable predictions by an order of magnitude in complex systems exhibiting turbulent and chaotic behavior, opening up a novel application of diffusion models in the context of neural emulation.
Paperid:2995
Authors:Guanzheng Chen · Qilong Feng · Jinjie Ni · Xin Li · Michael Shieh
Abstract:
Abstract:The emergence of longcontext large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference, particularly in managing key-value (KV) caches, presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We present Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter—a draft LLM operating on shortened retrieval contexts—to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer dynamic that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both approaches, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2$\times$ speedups. Our analyses reveal that RAPID achieves robust acceleration beyond 32K context length and demonstrates superior generation quality in real-world applications.
Paperid:2996
Authors:Xialie Zhuang · Zhikai Jia · Jianjin Li · Zhenyu Zhang · Li Shen · Zheng Cao · Shiwei Liu
Abstract:
Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose MaskEnhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77% percentage points. Our analysis indicates that MEAP’s effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model’s focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models. Code has been submitted.
Paperid:2997
Authors:Seungho Baek · Taegeon Park · Jongchan Park · Seungjun Oh · Yusung Kim
Abstract:
Existing offline hierarchical reinforcement learning (HRL) methods rely on highlevel policy learning to generate subgoal sequences, but their efficiency degrades as task horizons increase, and they lack effective strategies for stitching useful state transitions across different trajectories. We propose Graph-Assisted Stitching (GAS), a novel framework that formulates subgoal selection as a graph search problem rather than learning an explicit high-level policy. By embedding states into a Temporal Distance representation space, GAS clusters semantically similar states from different trajectories into unified graph nodes, enabling efficient transition stitching. A shortest-path algorithm is then applied to select subgoal sequences within the graph, while a low-level policy learns to reach the subgoals. To improve graph quality, we introduce the Temporal Efficiency metric, which filters out noisy or inefficient transition states, significantly enhancing task performance. GAS outperforms prior hierarchical RL methods across locomotion, navigation, and manipulation tasks, achieving up to 83.6%p improvement in the most stitching-critical task.. Our source code is included in the supplementary material.
Paperid:2998
Authors:Xingjian Wu · Xiangfei Qiu · Hongfan Gao · Jilin Hu · Chenjuan Guo · Bin Yang
Abstract:
Abstract:Probabilistic Time Series Forecasting (PTSF) plays a crucial role in decisionmaking across various fields, including economics, energy, and transportation. Most existing methods excell at short-term forecasting, while overlooking the hurdles of Long-term Probabilistic Time Series Forecasting (LPTSF). As the forecast horizon extends, the inherent nonlinear dynamics have a significant adverse effect on prediction accuracy, and make generative models inefficient by increasing the cost of each iteration. To overcome these limitations, we introduce $K^2$VAE, an efficient VAE-based generative model that leverages a KoopmanNet to transform nonlinear time series into a linear dynamical system, and devises a KalmanNet to refine predictions and model uncertainty in such linear system, which reduces error accumulation in long-term forecasting. Extensive experiments demonstrate that $K^2$VAE outperforms state-of-the-art methods in both short- and long-term PTSF, providing a more efficient and accurate solution.
Paperid:2999
Authors:Tian-Zuo Wang · Wen-Bo Du · Zhi-Hua Zhou
Abstract:
In causality studies, maximal ancestral graph (MAG) is widely used to characterize the causal relations among observable variables in the presence of latent variables. However, given observational data, only a partial ancestral graph representing a Markov equivalence class (MEC) of MAGs is identifiable, which generally contains uncertain causal relations. Due to the uncertainties, \emph{MAG listing}, \emph{i.e.}, listing all the MAGs in the MEC, is critical for many downstream tasks. In this paper, we present the first \emph{polynomialdelay} MAG listing method, where delay refers to the time for outputting each MAG, through introducing enumerated structural knowledge in the form of \emph{singleton background knowledge (BK)}. To incorporate such knowledge, we propose the \emph{sound} and \emph{locally complete} orientation rules. By recursively introducing singleton BK and applying the rules, our method can output all and only MAGs in the MEC with polynomial delay. Additionally, while the proposed novel rules enable a more efficient MAG listing process, for the goal of incorporating BK, we reveal that existing rules including ours, are not yet \emph{complete}, which motivates further research. Experimental results validate the efficiency of the proposed MAG listing method.
Paperid:3000
Authors:Vladimir Braverman · Prathamesh Dharangutte · Shaofeng Jiang · Hoai-An Nguyen · Chen Wang · Yubo Zhang · Samson Zhou
Abstract:
Abstract:We study fair clustering problems in a setting where distance information is obtained from two sources: a strong oracle providing exact distances, but at a high cost, and a weak oracle providing potentially inaccurate distance estimates at a low cost. The goal is to produce a nearoptimal *fair* clustering on $n$ input points with a minimum number of strong oracle queries. This models the increasingly common trade-off between accurate but expensive similarity measures (e.g., large-scale embeddings) and cheaper but inaccurate alternatives. The study of fair clustering in the model is motivated by the important quest of achieving fairness with the presence of inaccurate information.We achieve the first $(1+\varepsilon)$-coresets for fair $k$-median clustering using $\text{poly}\left(\frac{k}{\varepsilon}\cdot\log n\right)$ queries to the strong oracle. Furthermore, our results imply coresets for the standard setting (without fairness constraints), and we could in fact obtain $(1+\varepsilon)$-coresets for $(k,z)$-clustering for general $z=O(1)$ with a similar number of strong oracle queries. By contrast, previous results achieved a constant-factor $(>10)$ approximation for the standard $k$-clustering problems, and no previous work considered the fair $k$-median clustering problem.
Paperid:3001
Authors:Jan Ludziejewski · Maciej Pióro · Jakub Krajewski · Michał Krutul · Jan Małaśnicki · Maciej Stefaniak · Piotr Sankowski · Marek Cygan · Kamil Adamczewski · Piotr Milos · Sebastian Jaszczur
Abstract:
Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and realworld applications of large-scale machine learning models. However, their scalability and efficiency under memory constraints remain relatively underexplored. In this work, we present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts. Our findings provide a principled framework for selecting the optimal MoE configuration under fixed memory and compute budgets. Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom. Extensive empirical validation confirms the theoretical predictions of our scaling laws. These results offer actionable insights for designing and deploying MoE models in practical large-scale training scenarios.
Paperid:3002
Authors:Won-Jun Jang · Hyeon-Seo Park · Si-Hyeon Lee
Abstract:
Federated ensemble distillation addresses client heterogeneity by generating pseudolabels for an unlabeled server dataset based on client predictions and training the server model using the pseudo-labeled dataset. The unlabeled server dataset can either be pre-existing or generated through a data-free approach. The effectiveness of this approach critically depends on the method of assigning weights to client predictions when creating pseudo-labels, especially in highly heterogeneous settings. Inspired by theoretical results from GANs, we propose a provably near-optimal weighting method that leverages client discriminators trained with a server-distributed generator and local datasets. Our experiments on various image classification tasks demonstrate that the proposed method significantly outperforms baselines. Furthermore, we show that the additional communication cost, client-side privacy leakage, and client-side computational overhead introduced by our method are negligible, both in scenarios with and without a pre-existing server dataset.
Paperid:3003
Authors:Chenxiao Yang · Nati Srebro · David McAllester · Zhiyuan Li
Abstract:
We introduce PENCIL to address the fundamental writeonly limitation of CoT by incorporating a reduction mechanism into the generation process, which allows the model to automatically clean up intermediate computations that would otherwise accumulate indefinitely in the context. With the reduction mechanism, the maximal context length during generation is reduced from the time complexity for solving the problem, which is often exponential for hard tasks, to the actual space required, which is polynomial. By managing space efficiently, PENCIL can generate longer thoughts using small memory and thus scale up to larger problems with more inference time. Notably, PENCIL achieves 97\% accuracy on the challenging Einstein's puzzle using a 25M-transformer with 2048 context length using average inference time 43s. Theoretically, we show PENCIL can perform universal computation by simulating a Turing machine with optimal time and space complexity.
Paperid:3004
Authors:Chengzu Li · Wenshan Wu · Huanyu Zhang · Yan Xia · Shaoguang Mao · Li Dong · Ivan Vulić · Furu Wei
Abstract:
Chainof-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.
Paperid:3005
Authors:Zizheng Huang · Haoxing Chen · Jiaqi Li · jun lan · Huijia Zhu · Weiqiang Wang · Limin Wang
Abstract:
Recent Vision Mamba (Vim) models exhibit nearly linear complexity in sequence length, making them highly attractive for processing visual data. However, the training methodologies and their potential are still not sufficiently explored. In this paper, we investigate strategies for Vim and propose Stochastic LayerWise Shuffle (SLWS), a novel regularization method that can effectively improve the Vim training. Without architectural modifications, this approach enables the non-hierarchical Vim to get leading performance on ImageNet-1K compared with the similar type counterparts. Our method operates through four simple steps per layer: probability allocation to assign layer-dependent shuffle rates, operation sampling via Bernoulli trials, sequence shuffling of input tokens, and order restoration of outputs. SLWS distinguishes itself through three principles: (1) Plug-and-play: No architectural modifications are needed, and it is deactivated during inference. (2) Simple but effective: The four-step process introduces only random permutations and negligible overhead. (3) Intuitive design: Shuffling probabilities grow linearly with layer depth, aligning with the hierarchical semantic abstraction in vision models. Our work underscores the importance of tailored training strategies for Vim models and provides a helpful way to explore their scalability.
Paperid:3006
Authors:Xin Cheng · Jiabo Ye · Haiyang Xu · Ming Yan · Ji Zhang · Feng Liu · Fei Huang · Lei Feng
Abstract:
Endowing large language models (LLMs) with continual learning (CL) capacities is practically important, which enables them to dynamically acquire new knowledge over time. Although many effective methods have been proposed for CL of LLMs, they share a common limitation: information leakage (IL), where the taskrelated information of learned tasks is accessed or reused again. IL not only imposes potential risks on data privacy protection but also significantly hinders the deployment of LLMs in real-world scenarios. To avoid IL while maintaining outstanding CL performance, we propose a novel CL method for LLMs, which first characterizes a parameter-efficient fine-tuning (PEFT) block by a presentative feature distribution, and then dynamically selects the appropriate PEFT blocks for each instance based on its similarity with the presentative feature distributions. Extensive experiments validate the effectiveness of our method on the CL of LLM, showcasing its potential to enhance both privacy and adaptability in practical applications.
Paperid:3007
Authors:Zhonglin Cao · Mario Geiger · Allan Costa · Danny Reidenbach · Karsten Kreis · Tomas Geffner · Franco Pellegrini · Guoqing Zhou · Emine Kucukbenli
Abstract:
Fast and accurate generation of molecular conformers is desired for downstream computational chemistry and drug discovery tasks. Currently, training and sampling stateof-the-art diffusion/flow based models for conformer generation require significant computational expense. In this work, we build upon flow-matching and propose two mechanisms for accelerating training and inference of 3D molecular conformer generation. For fast training, we introduce the SO(3)-Averaged Flowtraining objective, which leads to faster convergence to better generation quality compared to conditional optimal transport flow or Kabsch-aligned flow. For fast inference, we show that reflow methods and distillation of flow-based model enable few-steps or even one-step molecular conformer generation with high quality. We demonstrate that a model trained using these two techniques can match the performance of strong baselines with only a fraction of the number of parameters and generation steps. The training techniques proposed in this work shows the path towards highly efficient molecular conformer generation with flow-based models.
Paperid:3008
Authors:Junlin Han · Jianyuan Wang · Andrea Vedaldi · Phil Torr · Filippos Kokkinos
Abstract:
Generating highquality 3D content from text, single images, or sparse view images remains a challenging task with broad applications.Existing methods typically employ multi-view diffusion models to synthesize multi-view images, followed by a feed-forward process for 3D reconstruction. However, these approaches are often constrained by a small and fixed number of input views, limiting their ability to capture diverse viewpoints and, even worse, leading to suboptimal generation results if the synthesized views are of poor quality.To address these limitations, we propose Flex3D, a novel two-stage framework capable of leveraging an arbitrary number of high-quality input views. The first stage consists of a candidate view generation and curation pipeline. In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs. Through extensive exploration of design and training strategies, we optimize FlexRM to achieve superior performance in both reconstruction and generation tasks. Our results demonstrate that Flex3D achieves state-of-the-art performance, with a user study winning rate of over 92% in 3D generation tasks when compared to several of the latest feed-forward 3D generative models.
Paperid:3009
Authors:Jie Liu · Pan Zhou · Zehao Xiao · Jiayi Shen · Wenzhe Yin · Jan-jakob Sonke · Efstratios Gavves
Abstract:
Interactive 3D segmentation has emerged as a promising solution for generating accurate object masks in complex 3D scenes by incorporating userprovided clicks. However, two critical challenges remain underexplored: (1) effectively generalizing from sparse user clicks to produce accurate segmentations and (2) quantifying predictive uncertainty to help users identify unreliable regions. In this work, we propose \emph{NPISeg3D}, a novel probabilistic framework that builds upon Neural Processes (NPs) to address these challenges. Specifically, NPISeg3D introduces a hierarchical latent variable structure with scene-specific and object-specific latent variables to enhance few-shot generalization by capturing both global context and object-specific characteristics. Additionally, we design a probabilistic prototype modulator that adaptively modulates click prototypes with object-specific latent variables, improving the model’s ability to capture object-aware context and quantify predictive uncertainty. Experiments on four 3D point cloud datasets demonstrate that NPISeg3D achieves superior segmentation performance with fewer clicks while providing reliable uncertainty estimations.
Paperid:3010
Authors:Feng Wu · Tsai Hor Chan · Fuying Wang · Guosheng Yin · Lequan Yu
Abstract:
Various data modalities are common in realworld applications. (e.g., EHR, medical images and clinical notes in healthcare). Thus, it is essential to develop multimodal learning methods to aggregate information from multiple modalities. The main challenge is appropriately aligning and fusing the representations of different modalities into a joint distribution. Existing methods mainly rely on concatenation or the Kronecker product, oversimplifying interactions structure between modalities and indicating a need to model more complex interactions. Additionally, the joint distribution of latent representations with higher-order interactions is underexplored. Copula is a powerful statistical structure in modelling the interactions between variables, as it bridges the joint distribution and marginal distributions of multiple variables. In this paper, we propose a novel copula modelling-driven multimodal learning framework, which focuses on learning the joint distribution of various modalities to capture the complex interaction among them. The key idea is interpreting the copula model as a tool to align the marginal distributions of the modalities efficiently. By assuming a Gaussian mixture distribution for each modality and a copula model on the joint distribution, our model can also generate accurate representations for missing modalities. Extensive experiments on public MIMIC datasets demonstrate the superior performance of our model over other competitors. The code is anonymously available at https://anonymous.4open.science/r/CM2-C1FD/README.md.
Paperid:3011
Authors:Zhe Zhang · Mingxiu Cai · Hanxiao Wang · Gaochang Wu · Tianyou Chai · Xiatian Zhu
Abstract:
Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstructionbased) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach CostFilter-AD. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released.
Paperid:3012
Authors:Chen Xu
Abstract:
Abstract:We propose a novel method, namely Gaussian Smoothing with a PowerTransformed Objective (GSPTO), that solves global optimization problems in two steps: (1) perform a (exponential) power-$N$ transformation to the not necessarily differentiable objective $f:\mathbb{R}^d\rightarrow \mathbb{R}$ and get $f_N$, and (2) optimize the Gaussian-smoothed $f_N$ with stochastic approximations. Under mild conditions on $f$, for any $\delta>0$, we prove that with a sufficiently large power $N_\delta$, this method converges to a solution in the $\delta$-neighborhood of $f$'s global optimum point, at the iteration complexity of $O(d^4\varepsilon^{-2})$. If we require that $f$ is differentiable and further assume the Lipschitz condition on $f$ and its gradient, the iteration complexity reduces to $O(d^2\varepsilon^{-2})$, which is significantly faster than the standard homotopy method. In most of the experiments performed, our method produces better solutions than other algorithms that also apply the smoothing technique.
Paperid:3013
Authors:Yuhao Sun · Jiacheng Zhang · Zesheng Ye · Chaowei Xiao · Feng Liu
Abstract:
Abstract:*Diffusionbased purification* (DBP) methods aim to remove adversarial noise from the input sample by first injecting Gaussian noise through a forward diffusion process, and then recovering the clean example through a reverse generative process. In the above process, how much Gaussian noise is injected to the input sample is key to the success of DBP methods, which is controlled by a constant noise level $t^*$ for all samples in existing methods. In this paper, we discover that an optimal $t^*$ for each sample indeed could be different. Intuitively, the cleaner a sample is, the less the noise it should be injected, and vice versa. Motivated by this finding, we propose a new framework, called Sample-specific Score-aware Noise Injection (SSNI). Specifically, SSNI uses a pre-trained score network to estimate how much a data point deviates from the clean data distribution (i.e., score norms). Then, based on the magnitude of score norms, SSNI applies a reweighting function to adaptively adjust $t^*$ for each sample, achieving sample-specific noise injections. Empirically, incorporating our framework with existing DBP methods results in a notable improvement in both accuracy and robustness on CIFAR-10 and ImageNet-1K, highlighting the necessity to allocate *distinct noise levels to different samples* in DBP methods. Our code is available at: https://anonymous.4open.science/r/SSNI-F746.
Paperid:3014
Authors:Qi Yang · Le Yang · Geert Van der Auwera · Zhu Li
Abstract:
Most existing 3D Gaussian Splatting (3DGS) compression schemes focus on producing compact 3DGS representation via implicit data embedding. They have long coding times and highly customized data format, making it difficult for widespread deployment. This paper presents a new 3DGS compression framework called HybridGS, which takes advantage of both compact generation and standardized point cloud data encoding. HybridGS first generates compact and explicit 3DGS data. A dualchannel sparse representation is introduced to supervise the primitive position and feature bit depth. It then utilizes a canonical point cloud encoder to perform further data compression and form standard output bitstreams. A simple and effective rate control scheme is proposed to pivot the interpretable data compression scheme. At the current stage, HybridGS does not include any modules aimed at improving 3DGS quality during generation. But experiment results show that it still provides comparable reconstruction performance against state-of-the-art methods, with evidently higher encoding and decoding speed.
Paperid:3015
Authors:Quan Nguyen · Nishant Mehta · Cristóbal Guzmán
Abstract:
Abstract:The minimax sample complexity of group distributionally robust optimization (GDRO) has been determined up to a $\log(K)$ factor, where $K$ is the number of groups. In this work, we venture beyond the minimax perspective via a novel notion of sparsity that we call $(\lambda, \beta)$sparsity. In short, this condition means that at any parameter $\theta$, there is a set of at most $\beta$ groups whose risks at $\theta$ are all at least $\lambda$ larger than the risks of the other groups. To find an $\epsilon$-optimal $\theta$, we show via a novel algorithm and analysis that the $\epsilon$-dependent term in the sample complexity can swap a linear dependence on $K$ for a linear dependence on the potentially much smaller $\beta$. This improvement leverages recent progress in sleeping bandits, showing a fundamental connection between the two-player zero-sum game optimization framework for GDRO and per-action regret bounds in sleeping bandits. We next show an adaptive algorithm which, up to logarithmic factors, obtains a sample complexity bound that adapts to the best $(\lambda, \beta)$-sparsity condition that holds. We also show how to obtain a dimension-free semi-adaptive sample complexity bound with a computationally efficient method.Finally, we demonstrate the practicality of the $(\lambda, \beta)$-sparsity condition and the improved sample efficiency of our algorithms on both synthetic and real-life datasets.
Paperid:3016
Authors:Yurun Yuan · Tengyang Xie
Abstract:
Leveraging more testtime computation has proven to be an effective way to boost the reasoning capabilities of large language models (LLMs). Among various methods, the verify-and-improve paradigm has the advantage of allowing LLMs to dynamically explore solution spaces and incorporate external feedback during inference. However, existing approaches often face limitations like restricted feedback spaces and the absence of dedicated training processes to optimize collaboration of different parties, leading to suboptimal performance. To address these challenges, we propose DPSDP (DirectPolicySearch byDynamicProgramming), a reinforcement learning algorithm that trains a multi-agent LLM system composed of an actor and a critic to iteratively refine responses for reasoning tasks through direct preference learning on self-generated data.We formulate the multi-turn refinement process as a Markov Decision Process and derive a practical algorithm that generalize to out-of-distribution horizons at test time. In theory, we demonstrate that DPSDP can achieve performance equivalent to any comparator policy covered in the training distribution. We instantiate DPSDP with various base models and evaluate its effectiveness on both in-distribution and out-of-distribution benchmarks. Our experimental results highlight the effectiveness of our method. Specifically, by taking five answer attempts on problems from MATH, Ministral-based models improves their first-turn accuracy from 58.2% to 63.2%, and Llama-3.1-based models improved from 55.8% to 58.4%. We also conduct a comprehensive ablation study, demonstrating the generalizability of out-of-distribution refinement iterations and the advantage over a single-agent LLM.
Paperid:3017
Authors:Xintong Sun · Chi Wei · Minghao Tian · Shiwen Ni
Abstract:
Abstract:Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domainspecific language(DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures.
Paperid:3018
Authors:Haoyang Li · Xin Wang · Zeyang Zhang · Zongyuan Wu · Linxin Xiao · Wenwu Zhu
Abstract:
Selfsupervised learning (SSL) on graph-structured data has attracted considerable attention recently. Masked graph autoencoder, as one promising generative graph SSL approach that aims to recover masked parts of the input graph data, has shown great success in various downstream graph tasks. However, existing masked graph autoencoders fail to consider the degrees of difficulties of recovering the masked edges that often have different impacts on the model performance, resulting in suboptimal node representations. To tackle this challenge, in this paper, we propose a novel curriculum based self-supervised masked graph autoencoder that is able to capture and leverage the underlying degree of difficulties of data dependencies hidden in edges, and design better mask-reconstruction pretext tasks for learning informative node representations. Specifically, we first design a difficulty measurer to identify the underlying structural degree of difficulty of edges during the masking step. Then, we adopt a self-paced scheduler to determine the order of masking edges, which encourages the graph encoder to learn from easy parts to difficult parts. Finally, the masked edges are gradually incorporated into the reconstruction pretext task, leading to high-quality node representations. Experiments on several real-world node classification and link prediction datasets demonstrate the superiority of our proposed method over state-of-the-art graph self-supervised learning baselines. This work is the first study of curriculum strategy for masked graph autoencoders, to the best of our knowledge.
Paperid:3019
Authors:Yunfei Huang · David S. Greenberg
Abstract:
Neural PDE surrogates can improve the costaccuracy tradeoff of classical solvers, but often generalize poorly to new initial conditions and accumulate errors over time. Physical and symmetry constraints have shown promise in closing this performance gap, but existing techniques for imposing these inductive biases are incompatible with the staggered grids commonly used in computational fluid dynamics. Here we introduce novel input and output layers that respect physical laws and symmetries on the staggered grids, and for the first time systematically investigate how these constraints, individually and in combination, affect the accuracy of PDE surrogates. We focus on two challenging problems: shallow water equations with closed boundaries and decaying incompressible turbulence. Compared to strong baseline surrogates, both types of constraints consistently improve performance across tasks, architectures, autoregressive prediction steps, accuracy measures, and network sizes. Symmetries are more effective, but do not eliminate the need for physical constraints. Doubly-constrained surrogates outperform baselines for the same network and dataset sizes, providing greater improvements than symmetry-based data augmentation or pushforward training, while also benefiting from the pushforward trick. Doubly-constrained surrogates also generalize better to initial conditions and durations beyond the range of the training data.
Paperid:3020
Authors:Zichong Wang · Wenbin Zhang
Abstract:
Graph generation models have shown significant potential across various domains. However, despite their success, these models often inherit societal biases, limiting their adoption in realworld applications. Existing research on fairness in graph generation primarily addresses structural bias, overlooking the critical issue of feature bias. To address this gap, we propose FDGen, a novel approach that defines and mitigates both feature and structural biases in graph generation models. Furthermore, we provide a theoretical analysis of how bias sources in graph data contribute to disparities in graph generation tasks. Experimental results on four real-world datasets demonstrate that FDGen outperforms state-of-the-art methods, achieving notable improvements in fairness while maintaining competitive generation performance.
Paperid:3021
Authors:Qixin Zhang · Wei Huang · Can Jin · Puning Zhao · Yao Shu · Li Shen · Dacheng Tao
Abstract:
Abstract:Identifying the most representative subset for a closeto-submodular objective while satisfying the predefined partition constraint is a fundamental task with numerous applications in machine learning. However, the existing distorted local-search methods are often hindered by their prohibitive query complexities and the rigid requirement for prior knowledge of difficult-to-obtain structural parameters. To overcome these limitations, we introduce a novel algorithm titled Multinoulli-SCG, which not only is parameter-free, but also can achieve the same approximation guarantees as the distorted local-search methods with significantly fewer function evaluations. More specifically, when the objective function is monotone $\alpha$-weakly DR-submodular or $(\gamma,\beta)$-weakly submodular, our Multinoulli-SCG algorithm can attain a value of $(1-e^{-\alpha})\text{OPT}-\epsilon$ or $(\frac{\gamma^{2}(1-e^{-(\beta(1-\gamma)+\gamma^2)})}{\beta(1-\gamma)+\gamma^2})\text{OPT}-\epsilon$ with only $O(1/\epsilon^{2})$ function evaluations, where OPT denotes the optimal value. The core of our \textbf{Multinoulli-SCG} algorithm is an innovative continuous-relaxation framework named Multinoulli Extension(ME), which can effectively convert the discrete subset selection problem subject to partition constraints into a solvable continuous maximization focused on learning the optimal multinoulli priors across the considered partition. In sharp contrast with the well-established multi-linear extension for submodular subset selection, a notable advantage of our proposed ME is its intrinsic capacity to provide a lossless rounding scheme for any set function. Finally, we validate the practical efficacy of our proposed algorithms by applying them to video summarization, bayesian A-optimal design and coverage maximization.
Paperid:3022
Authors:Edoardo Cetin · Tianyu Zhao · Yujin Tang
Abstract:
We propose a new finetuning method to provide pretrained large language models (LMs) the ability to scale test-time compute through the diffusion framework. By increasing the number of diffusion steps, we show our finetuned models achieve monotonically increasing accuracy, directly translating to improved performance across downstream tasks. Furthermore, our finetuned models can expertly answer questions on specific topics by integrating powerful guidance techniques, and autonomously determine the compute required for a given problem by leveraging adaptive ODE solvers. Our method is applicable to any foundation model pre-trained with cross-entropy and does not modify any of its original weights, fully preserving its strong single-step generation capabilities. We show our method can be more effective and is fully compatible with traditional finetuning and search approaches, introducing an orthogonal new direction to unify the strengths of the autoregressive and diffusion frameworks.
Paperid:3023
Authors:Hongyi Wan · Shiyuan Ren · Wei Huang · Miao Zhang · Xiang Deng · Yixin Bao · Liqiang Nie
Abstract:
Continual learning (CL) is crucial for advancing humanlevel intelligence, but its theoretical understanding, especially regarding factors influencing forgetting, is still relatively limited. This work aims to build a unified theoretical framework for understanding CL using feature learning theory. Different from most existing studies that analyze forgetting under linear regression model or lazy training, we focus on a more practical two-layer convolutional neural network (CNN) with polynomial ReLU activation for sequential tasks within a signal-noise data model. Specifically, we theoretically reveal how the angle between task signal vectors influences forgetting that:acute or small obtuse angles lead to benign forgetting, whereas larger obtuse angles result in harmful forgetting. Furthermore, we demonstrate that the replay method alleviates forgetting by expanding the range of angles corresponding to benign forgetting. Our theoretical results suggest that mid-angle sampling, which selects examples with moderate angles to the prototype, can enhance the replay method's ability to mitigate forgetting. Experiments on synthetic and real-world datasets confirm our theoretical results and highlight the effectiveness of our mid-angle sampling strategy.
Paperid:3024
Authors:Xuehang Guo · Xingyao Wang · Yangyi Chen · Sha Li · Chi Han · Manling Li · Heng Ji
Abstract:
Software engineering (SE) is increasingly collaborative, with developers working together on shared complex codebase. Effective collaboration in shared environments requires participants—whether humans or AI agents—to stay on the same page as their environment evolves. When a collaborator’s understanding diverges from the current state—what we term the outof-sync challenge—the collaborator’s actions may fail. This occurs frequently in software engineering (SE) when developers unknowingly work with outdated codebase versions, leading to failed builds and integration issues. In this work, we introduce SyncMind, a framework that systematically defines he out-of-sync problem faced by large language model (LLM) agents in collaborative software engineering (CSE). Based on SyncMind, we create SyncBench, a benchmark featuring 24,332 instances of agent out-of-sync scenarios in real-world CSE derived from 21 popular GitHub repositories with executable verification tests. Experiments on SyncBench uncover critical insights into existing LLM agents’ capabilities and limitations. Besides substantial performance gap among agents (from Llama-3.1 agents ≤ 3.33% to Claude-3.5-Sonnet ≥ 28.18%), their consistently low collaborative initiative (≤ 4.86%) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with out-of-sync recovery success. Minimal performance differences in agents’ resource-aware out-of-sync recoveries further reveal their significant lack of resource awareness and adaptability, shedding light on future resource-efficient collaborative systems.
Paperid:3025
Authors:Chenyang Zhang · Xiaoyu Zhang · Jian Lou · KAI WU · Zilong Wang · Xiaofeng Chen
Abstract:
Multimodal RetrievalAugmented Generation (MuRAG) systems have been widely applied to Large Vision-Language Models (LVLMs) to enhance their generating ability. However, the reliance on external multimodal knowledge databases renders MuRAG systems vulnerable to malicious poisoning attacks. In this paper, we introduce PoisonedEye, the first knowledge poisoning attack designed for MuRAG systems. Our attack achieves manipulating response of the MuRAG system for the target query by injecting only one poison sample into the knowledge database. To construct the poison sample, we follow two key properties for the retrieval and generation process and identify the solution by satisfying these properties. Besides, we also introduce class query targeted poisoning attack, a more generalized strategy that extends the poisoning effect to an entire class of target queries. Extensive experiments on multiple query datasets, retrievers, and LVLMs demonstrate that our attack is highly effective on comprising MuRAG systems.
Paperid:3026
Authors:Jiaqi Wang · Jingtao Li · Weiming Zhuang · Chen Chen · Lingjuan Lyu · Fenglong Ma
Abstract:
Vision foundation models (FMs) like CLIP have demonstrated exceptional capabilities in visual and linguistic understanding, particularly in zeroshot inference tasks within web-based applications. However, these models struggle when encountering data thatdeviates significantly from their training samples, necessitatingfine-tuning. Traditional fine-tuning approaches are often impractical in centralized settings due to the privacy-sensitive natureof data across web platforms and WoT devices. Federated learning (FL), combined with parameter-efficient fine-tuning (PEFT), offers a potential solution, but current methods face challenges withdomain-specific characteristics and out-of-domain generalizationin distributed web and WoT systems. To address these challenges,we propose Federated Adapter Generalization (FedAG), a cross-silofederated fine-tuning approach that leverages multiple fine-grainedadapters to capture domain-specific knowledge while enhancingout-of-domain generalization. Our method integrates quality-awarein-domain mutual learning and attention-regularized cross-domainlearning to effectively handle domain shifts. Extensive experimentswith the CLIP model on domain-shifting datasets—ImageCLEF-DA,Office-Home, and DomainNet—showcase the superior performanceof FedAG in both in-domain and out-of-domain scenarios. We envision this work as a significant step toward generalizing CLIPfor out-of-domain tasks, ensuring privacy-preserving learning andadaptation in web-based applications.
Paperid:3027
Authors:Zehong Ma · Shiliang Zhang · Longhui Wei · Qi Tian
Abstract:
Traditional approaches to adapting multimodal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments.
Paperid:3028
Authors:Heyang Zhao · Chenlu Ye · Wei Xiong · Quanquan Gu · Tong Zhang
Abstract:
Abstract:Recent advances in Reinforcement Learning from Human Feedback (RLHF) have shown that KLregularization plays a pivotal role in improving the efficiency of RL fine-tuning for large language models (LLMs). Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored. While there is a recent line of work on the theoretical analysis of KL-regularized objective in decision making \citep{xiong2024iterative, xie2024exploratory,zhao2024sharp}, these analyses either reduce to the traditional RL setting or rely on strong coverage assumptions. In this paper, we propose an optimism-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret. By carefully leveraging the benign optimization landscape induced by the KL-regularization and the optimistic reward estimation, our algorithm achieves an $\mathcal{O}\big(\eta\log (N_{\mathcal R} T)\cdot d_{\mathcal R}\big)$ logarithmic regret bound, where $\eta, N_{\mathcal R},T,d_{\mathcal R}$ denote the KL-regularization parameter, the cardinality of the reward function class, number of rounds, and the complexity of the reward function class. Furthermore, we extend our algorithm and analysis to reinforcement learning by developing a novel decomposition over transition steps and also obtain a similar logarithmic regret bound.
Paperid:3029
Authors:Xingyi Yang · Constantin Venhoff · Ashkan Khakzar · Christian Schroeder de Witt · Puneet Dokania · Adel Bibi · Phil Torr
Abstract:
Neurons in large language models often exhibit \emph{polysemanticity}, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on posthoc methods, we present \textbf{MoE-X}, a mixture-of-experts (MoE) language model designed to be \emph{intrinsically} interpretable. Our approach is motivated by the observation that, in language models, wider networks with sparse activations are more likely to capture interpretable factors. however, directly training such large sparse networks is computationally prohibitive. MoE architectures offer a scalable alternative by activating only a subset of experts for any given input, inherently aligning with interpretability objectives. In MoE-X, we establish this connection by rewriting the MoE layer as an equivalent sparse, large MLP. This approach enables efficient scaling of the hidden size while maintaining sparsity. To further enhance interpretability, we enforce sparse activation within each expert and redesign the routing mechanism to prioritize experts with the highest activation sparsity. These designs ensure that only the most salient features are routed and processed by the experts. We evaluate MoE-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.
Paperid:3030
Authors:Yu-Zhe Shi · Mingchen Liu · Hanlu Ma · Qiao Xu · Huamin Qu · Kun He · Lecheng Ruan · Qining Wang
Abstract:
Industrial designers have long sought a natural and intuitive way to achieve the targeted control of prototype modelsusing simple natural language instructions to configure and adjust the models seamlessly according to their intentions, without relying on complex modeling commands. While Large Language Models have shown promise in this area, their potential for controlling prototype models through language remains partially underutilized. This limitation stems from gaps between designers' languages and modeling languages, including mismatch in abstraction levels, fluctuation in semantic precision, and divergence in lexical scopes. To bridge these gaps, we propose an interface architecture that serves as a medium between the two languages. Grounded in design principles derived from a systematic investigation of fast prototyping practices, we devise the interface's operational mechanism and develop an algorithm for its automated domain specification. Both machine-based evaluations and human studies on fast prototyping across various product design domains demonstrate the interface's potential to function as an auxiliary module for Large Language Models, enabling precise and effective targeted control of prototype models.
Paperid:3031
Authors:Zikui Cai · Yaoteng Tan · M. Salman Asif
Abstract:
Machine unlearning methods aim to remove sensitive or unwanted content from trained models, but typically demand extensive model updates at significant computational cost while potentially degrading model performance on both related and unrelated tasks. We propose Single Layer Unlearning Gradient (SLUG) as an efficient method to unlearn targeted information by updating a single critical layer using a onetime gradient computation. SLUG uses layer importance and gradient alignment metrics to identify the optimal layer for targeted information removal while preserving the model utility. We demonstrate the effectiveness of SLUG for CLIP, Stable Diffusion, and vision-language models in removing concrete (e.g., identities and objects) and abstract concepts (e.g., artistic styles). On the UnlearnCanvas benchmark, SLUG achieves comparable unlearning performance to existing methods while requiring significantly less computational resources. Our proposed approach offers a practical solution for targeted unlearning that is computationally efficient and precise. Our code is available at https://anonymous.4open.science/r/SLUG-6CDF.
Paperid:3032
Authors:Bartosz Cywiński · Kamil Deja
Abstract:
Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns.Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model.In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in textto-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts.Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Evaluation with the competitive UnlearnCanvas benchmark on object and style unlearning highlights SAeUron's state-of-the-art performance. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content, even under adversarial attack.
Paperid:3033
Authors:Su Hyeong Lee · Manzil Zaheer · Tian Li
Abstract:
Abstract:Distributed optimization has become the default training paradigm in modern machine learning due to the growing scale of models and datasets. To mitigate communication overhead, local updates are often applied before global aggregation, resulting in a nested optimization approach with inner and outer steps. However, heavytailed stochastic gradient noise remains a significant challenge, particularly in attention-based models, hindering effective training. In this work, we propose TailOPT, an efficient framework designed to address heavy-tailed noise by leveraging clipping techniques or adaptive optimization. We establish convergence guarantees for the TailOPT framework under heavy-tailed noise with potentially unbounded gradient variance and local updates. Among its variants, we highlight $Bi^2Clip$, a scalable algorithm performing coordinate-wise clipping at both the inner and outer optimization layers, achieving adaptive-like performance (e.g., Adam) without the cost of maintaining additional gradient statistics. Empirically, TailOPT, including $Bi^2Clip$, demonstrates superior performance on several language tasks and models, outperforming state-of-the-art methods.
Paperid:3034
Authors:Tingle Li · Baihe Huang · Xiaobin Zhuang · Dongya Jia · Jiawei Chen · Yuping Wang · Zhuo Chen · Yuxuan Wang · Gopala Anumanchipalli
Abstract:
Generating accurate sounds for complex audiovisual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an interactive object-aware audio generation model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the object level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds.
Paperid:3035
Authors:Yihang Wang · Yuying Qiu · Peng Chen · Kai Zhao · Yang Shu · Zhongwen Rao · Lujia Pan · Bin Yang · Chenjuan Guo
Abstract:
With the growing availability of multidomain time series data, there is an increasing demand for general forecasting models pre-trained on multi-source datasets to support diverse downstream prediction scenarios. Existing time series foundation models primarily focus on scaling up pre-training datasets and model sizes to enhance generalization performance. In this paper, we take a different approach by addressing two critical aspects of general forecasting models: (1) how to derive unified representations from heterogeneous multi-domain time series data, and (2) how to effectively capture domain-specific features to enable adaptive transfer across various downstream scenarios. To address the first aspect, we propose Decomposed Frequency Learning as the pre-training task, which leverages frequency-based masking and reconstruction to decompose coupled semantic information in time series, resulting in unified representations across domains. For the second aspect, we introduce the Time Series Register, which captures domain-specific representations during pre-training and enhances adaptive transferability to downstream tasks. Our model achieves the state-of-the-art forecasting performance on seven real-world benchmarks, demonstrating remarkable few-shot and zero-shot capabilities.
Paperid:3036
Authors:Xiang Hu · Zhihao Teng · Jun Zhao · Wei Wu · Kewei Tu
Abstract:
Abstract:Despite the success of Transformers, handling longer contexts remains challenging due to the limited length generalization and quadratic complexity of selfattention, which often requires post-training with a larger attention window, significantly increasing computational and memory costs. In this paper, we propose a novel attention mechanism based on dynamic context, Grouped Cross Attention (GCA), which can generalize to 1000 $\times$ the pre-training context length while maintaining the ability to access distant information with a constant attention window size. For a given input sequence, we split it into chunks and use each chunk to retrieve top-$k$ relevant past chunks for subsequent text generation. Specifically, unlike most previous works that use an off-the-shelf retriever, our key innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner, which adapts better to causal language models.Such a mechanism accommodates retrieved chunks with a fixed-size attention window to achieve long-range information access, significantly reducing computational and memory costs during training and inference. Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is $1000 \times$ the training length.
Paperid:3037
Authors:Cristobal Eyzaguirre · Igor Vasiljevic · Achal Dave · Jiajun Wu · Rareș Ambruș · Thomas Kollar · Juan Carlos Niebles · Pavel Tokmakov
Abstract:
We propose a datadriven approach to analyzing query complexity in Video Question Answering (VideoQA). Previous efforts in benchmark design have relied on human expertise to design challenging questions, yet we experimentally show that humans struggle to predict which questions are difficult for machine learning models. Our automatic approach leverages recent advances in code generation for visual question answering, using the complexity of generated code as a proxy for question difficulty. We demonstrate that this measure correlates significantly better with model performance than human estimates. To operationalize this insight, we propose an algorithm for estimating question complexity from code. It identifies fine-grained primitives that correlate with the hardest questions for any given set of models, making it easy to scale to new approaches in the future. Finally, to further illustrate the utility of our method, we extend it to automatically generate complex questions, constructing a new benchmark that is 1.9 times harder than the popular NExT-QA.
Paperid:3038
Authors:Songlin Zhai · Yuan Meng · Yongrui Chen · Yiwei Wang · Guilin Qi
Abstract:
Large Language Models (LLMs) have revolutionized various natural language processing tasks with their remarkable capabilities. However, challenges persist in effectively integrating new knowledge into LLMs without compromising their performance, particularly in the areas of longterm memory management and knowledge updates. In this paper, we introduce a novel memory augmentation framework that conceptualizes memory as a peripheral device (akin to physical RAM), with the LLM serving as the information processor (analogous to a CPU). Drawing inspiration from RAM architecture, we design memory as a sequence of memory banks, each modeled using Kolmogorov-Arnold Networks (KAN) to ensure smooth state transitions. Memory read and write operations are dynamically controlled by query signals derived from the LLMs' internal states, closely mimicking the interaction between a CPU and RAM. Furthermore, a dedicated memory bank is used to generate a mask value that indicates the relevance of the retrieved data, analogous to sign bits in binary coding schemes. The retrieved memory data is then integrated as a prefix to enhance subsequent downstream tasks. Extensive experiments on knowledge-based model editing and long-context question answering validate the effectiveness of the proposed method.
Paperid:3039
Authors:Haoming Yang · Ke Ma · Xiaojun Jia · Yingfei Sun · Qianqian Xu · Qingming Huang
Abstract:
Despite the remarkable performance of Large Language Models (LLMs), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on bruteforce optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, ICRT, inspired by heuristics and biases in human cognition. Leveraging the simplicity effect, we employ cognitive decomposition to reduce the complexity of malicious prompts. Simultaneously, relevance bias is utilized to reorganize prompts, enhancing semantic alignment and inducing harmful outputs effectively. Furthermore, we introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm by employing ranking aggregation methods such as Elo, HodgeRank, and Rank Centrality to comprehensively quantify the harmfulness of generated content. Experimental results demonstrate that our approach consistently bypasses the safety mechanisms of mainstream LLMs and generates actionable high-risk content. This framework provides detailed insights into the potential risks of jailbreak attacks and contributes to developing more robust defense strategies.
Paperid:3040
Authors:Aaron Havens · Benjamin Kurt Miller · Bing Yan · Carles Domingo i Enrich · Anuroop Sriram · Daniel S. Levine · Brandon Wood · Bin Hu · Brandon Amos · Brian Karrer · Xiang Fu · Guan-Horng Liu · Ricky T. Q. Chen
Abstract:
We introduce Adjoint Sampling, a highly scalable and efficient algorithm for learning diffusion processes that sample from unnormalized densities, or energy functions. It is the first of its kind in allowing significantly more gradient updates than the number of energy evaluations and model samples, allowing us to scale to much larger problem settings than previously explored by similar methods. Our framework is theoretically grounded in stochastic optimal control and provably converges without the need for corrective measures that push samples towards the target distribution such as sequential Monte Carlo. We show how to incorporate key symmetries, as well as periodic boundary conditions, for modeling molecules in both cartesian and torsional representations. We demonstrate the effectiveness of our approach through extensive experiments on classical energy functions, and further scale up to neural networkbased energy models where we perform amortized conformer generation across many molecular systems. To encourage research in developing highly-scalable sampling methods, we open source these challenging benchmarks, where successful methods can directly impact progress in computational chemistry.
Paperid:3041
Authors:MingCai Chen · Baoming Zhang · Zongbo Han · Yuntao Du · Wenyu Jiang · Yanmeng Wang · Shuai Feng · Bingkun BAO
Abstract:
Modern machine learning applications are characterized by the increasing size of deep models and the growing diversity of data modalities. This trend underscores the importance of efficiently adapting pretrained multi-modal models to the test distribution in real time, i.e., multi-modal test-time adaptation. In practice, the magnitudes of multi-modal shifts vary because multiple data sources interact with the impact factor in diverse manners. In this research, we define a new practical scenario as uni-modal distribution shift, where the distribution shift influences only one modality, leaving the others unchanged. Through theoretical and empirical analyses, we demonstrate that the presence of such shift impedes multi-modal fusion and leads to the negative transfer phenomenon in existing test-time adaptation techniques. To flexibly combat this unique shift, we propose a selective adaptation schema that incorporates multiple modality-specific adapters to accommodate potential shifts and a "router" module that determines which modality requires adaptation. Finally, we validate the effectiveness of our proposed method through extensive experimental evaluations.
Paperid:3042
Authors:Rudy Morel · Jiequn Han · Edouard Oyallon
Abstract:
We address the problem of predicting the next state of a dynamical system governed byunkowntemporal partial differential equations (PDEs) using only a short trajectory. While standard transformers provide a natural blackbox solution to this task, the presence of a well-structured evolution operator in the data, suggests a more tailored and efficient approach. Specifically, when the PDE is fully known, classical numerical solvers can evolve the state accurately with only a few parameters. Building on this observation, we introduce DISCO, a model that uses a large hypernetwork to process a short trajectory and generate the parameters of a much smaller operator network, which then predicts the next state through time integration. Our framework decouples dynamics estimation—i.e., DISCovering an evolution operator from a short trajectory—from state prediction—i.e., evolving this operator. Experiments show that pretraining our model on diverse physics datasets achieves state-of-the-art performance while requiring significantly fewer epochs. Moreover, it generalizes well and remains competitive when fine-tuned on downstream tasks.
Paperid:3043
Authors:Yong Liang Goh · Zhiguang Cao · Yining Ma · Jianan Zhou · Mohammed Haroon Dupty · Wee Sun Lee
Abstract:
Recent advances toward foundation models for routing problems have shown great potential of a unified deep model for various VRP variants. However, they overlook the complex realworld customer distributions. In this work, we advance the Multi-Task VRP (MTVRP) setting to the more realistic yet challenging Multi-Task Multi-Distribution VRP (MTMDVRP) setting, and introduce SHIELD, a novel model that leverages bothsparsityandhierarchyprinciples. Building on a deeper decoder architecture, we first incorporate the Mixture-of-Depths (MoD) technique to enforce sparsity. This improves both efficiency and generalization by allowing the model to dynamically select nodes to use or skip each decoder layer, providing the needed capacity to adaptively allocate computation for learning the task/distribution specific and shared representations. We also develop a context-based clustering layer that exploits the presence of hierarchical structures in the problems to produce better local representations. These two designs inductively bias the network to identify key features that are common across tasks and distributions, leading to significantly improved generalization on unseen ones. Our empirical results demonstrate the superiority of our approach over existing methods on 9 real-world maps with 16 VRP variants each.
Paperid:3044
Authors:Erez Peterfreund · Ofir Lindenbaum · Yuval Kluger · Boris Landa
Abstract:
Embedding and visualization techniques are essential for analyzing highdimensional data, but they often struggle with complex data governed by multiple latent variables, potentially distorting key structural characteristics. This paper considers scenarios where the observed features can be partitioned into mutually exclusive subsets, each capturing a different smooth substructure. In such cases, visualizing the data based on each feature partition can better characterize the underlying processes and structures in the data, leading to improved interpretability. To partition the features, we propose solving an optimization problem that promotes a graph Laplacian-based smoothness in each partition, thus prioritizing partitions with simpler geometric structures. Our approach generalizes traditional embedding and visualization techniques, allowing them to learn multiple embeddings simultaneously. We establish that if several independent or partially dependent manifolds are embedded in distinct feature subsets in high-dimensional space, then our framework can reliably identify the correct subsets with theoretical guarantees. Finally, we demonstrate the effectiveness of our approach in extracting multiple low-dimensional structures and partially independent processes from both simulated and real data.
Paperid:3045
Authors:Weihao Tan · Wentao Zhang · Xinrun Xu · Haochong Xia · gang Ding · Boyu Li · Bohan Zhou · Junpeng Yue · Jiechuan Jiang · Yewen Li · Ruyi An · Molei Qin · Chuqiao Zong · Longtao Zheng · YuJie Wu · Xiaoqiang Chai · Yifei Bi · Tianbao Xie · Pengjie Gu · Xiyun Li · Ceyao Zhang · Long Tian · Chaojie Wang · Xinrun Wang · Börje F. Karlsson · Bo An · Shuicheng YAN · Zongqing Lu
Abstract:
Despite their success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMMpowered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory, Cradle is able to understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning and information retrieval, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities:Skylines, Stardew Valley and Dealer's Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and CapCut), and a comprehensive benchmark, OSWorld. With a unified interface to interact with any software, Cradle greatly extends the reach of foundation agents thus paving the way for generalist agents.
Paperid:3046
Authors:Shijie Li · Weiqiang He · Ruobing Bai · Pan Peng
Abstract:
Abstract:Hierarchical clustering is a widely used method for unsupervised learning with numerous applications. However, in the application of modern algorithms, the datasets studied are usually large and dynamic. If the hierarchical clustering is sensitive to small perturbations of the dataset, the usability of the algorithm will be greatly reduced. In this paper, we focus on the hierarchical $k$ median clustering problem, which bridges hierarchical and centroid-based clustering while offering theoretical appeal, practical utility, and improved interpretability. We analyze the average sensitivity of algorithms for this problem by measuring the expected change in the output when a random data point is deleted. We propose an efficient algorithm for hierarchical $k$-median clustering and theoretically prove its low average sensitivity and high clustering quality. Additionally, we show that single linkage clustering and a deterministic variant of the CLNSS algorithm exhibit high average sensitivity, making them less stable. Finally, we validate the robustness and effectiveness of our algorithm through experiments.
Paperid:3047
Authors:Weiwei Li · Haitao Wu · Yunan Lu · Xiuyi Jia
Abstract:
Label distribution learning (LDL) is a powerful learning paradigm that emulates label polysemy by assigning label distributions over the label space. However, existing LDL evaluation metrics struggle to capture meaningful performance differences due to their insensitivity to subtle distributional changes, and existing LDL learning objectives often exhibit biases by disproportionately emphasizing a small subset of samples with extreme predictions. As a result, the LDL metrics lose their discriminability, and the LDL objectives are also at risk of overfitting. In this paper, we propose DeltaLDL, a percentage of predictions that are approximately correct within the context of LDL, as a solution to the above problems. DeltaLDL can serve as a novel evaluation metric, which is parameterfree and reflects more on real performance improvements. DeltaLDL can also serve as a novel learning objective, which is differentiable and encourages most samples to be predicted as approximately correct, thereby mitigating overfitting. Our theoretical analysis and empirical results demonstrate the effectiveness of the proposed solution.
Paperid:3048
Authors:Haitong Ma · Tianyi Chen · Kai Wang · Na Li · Bo Dai
Abstract:
Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the vanilla diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy, making training diffusion policies highly nontrivial in online RL. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and impractical. To enable efficient diffusion policy training for online RL, we propose Soft Diffusion Actor-Critic (SDAC), exploiting the viewpoint of diffusion models as noise-perturbed energy-based models. The proposed SDAC relies solely on the state-action value function as the energy functions to train diffusion policies, bypassing sampling from the optimal policy while maintaining lightweight computations. We conducted comprehensive comparisons on MuJoCo benchmarks. The empirical results show that SDAC outperforms all recent diffusion-policy online RLs on most tasks, and improves more than 120\% over soft actor-critic on complex locomotion tasks such as Humanoid and Ant.
Paperid:3049
Authors:Dun Ma · Jianguo Chen · Wenguo Yang · Suixiang Gao · Shengminjie Chen
Abstract:
In recent years, the pursuit of higher expressive power in graph neural networks (GNNs) has often led to more complex aggregation mechanisms and deeper architectures. To address these issues, we have identified redundant structures in GNNs, and by pruning them, we propose Pruned MPGNNs, K-Path GNNs, and K-Hop GNNs based on their original architectures. We show that 1) Although some structures are pruned in Pruned MP-GNNs and Pruned K-Path GNNs, their expressive power has not been compromised. 2) K-Hop MP-GNNs and their pruned architecture exhibit equivalent expressiveness on regular and strongly regular graphs. 3) The complexity of pruned K-Path GNNs and pruned K-Hop GNNs is lower than that of MP-GNNs, yet their expressive power is higher. Experimental results validate our refinements, demonstrating competitive performance across benchmark datasets with improved efficiency.
Paperid:3050
Authors:Kevin Galim · Wonjun Kang · Yuchen Zeng · HYUNG IL KOO · Kangwook Lee
Abstract:
Deep State Space Models (SSMs), such as Mamba (Gu & Dao, 2024), have become powerful tools for language modeling, offering high performance and linear scalability with sequence length. However, the application of parameterefficient fine-tuning (PEFT) methods to SSM-based models remains underexplored. We start by investigating two fundamental questions on existing PEFT methods: (i) How do they perform on SSM-based models? (ii) Which parameters should they target for optimal results? Our analysis shows that LoRA and its variants consistently outperform all other PEFT methods. While LoRA is effective for linear projection matrices, it fails on SSM modules—yet still outperforms other methods applicable to SSMs, indicating their limitations. This underscores the need for a specialized SSM tuning approach. To address this, we propose SDT, a PEFT method tailored for SSM modules. Combining SDT for SSMs with LoRA for linear projection matrices, we achieve state-of-the-art performance across extensive experiments.
Paperid:3051
Authors:Ming Li · Yukang Cheng · Lu Bai · Feilong Cao · Ke Lv · Jiye Liang · Pietro Lió
Abstract:
Abstract:The growing demand for personalized learning underscores the importance of accurately predicting students' future performance to support tailored education and optimize instructional strategies. Traditional approaches predominantly focus on temporal modeling using historical response records and learning trajectories. While effective, these methods often fall short in capturing the intricate interactions between students and learning content, as well as the subtle semantics of these interactions. To address these gaps, we present EduLLM, the first framework to leverage large language models in combination with hypergraph learning for student performance prediction. The framework incorporates FraSHNN ($\underline{\mbox{Fra}}$melet-based $\underline{\mbox{S}}$igned $\underline{\mbox{H}}$ypergraph $\underline{\mbox{N}}$eural $\underline{\mbox{N}}$etworks), a novel spectral-based model for signed hypergraph learning, designed to model interactions between students and multiple-choice questions. In this setup, students and questions are represented as nodes, while response records are encoded as positive and negative signed hyperedges, effectively capturing both structural and semantic intricacies of personalized learning behaviors. FraS-HNN employs framelet-based low-pass and high-pass filters to extract multi-frequency features. EduLLM integrates fine-grained semantic features derived from LLMs, synergizing with signed hypergraph representations to enhance prediction accuracy. Extensive experiments conducted on multiple educational datasets demonstrate that EduLLM significantly outperforms state-of-the-art baselines, validating the novel integration of LLMs with FraS-HNN for signed hypergraph learning.
Paperid:3052
Authors:Xinyu Guan · Li Lyna Zhang · Yifei Liu · Ning Shang · Youran Sun · Yi Zhu · Fan Yang · Mao Yang
Abstract:
We present rStarMath to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising ``deep thinking'' through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data synthesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On MATH benchmark, it improves Qwen2.5-Math-7B from 58.8\% to 90.0\%, surpassing o1-preview by +4.5\%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3\% (8/15) of problems, ranking among the top 20\% the brightest high school math students.
Paperid:3053
Authors:Roman Plaud · Alexandre Perez-Lebel · Matthieu Labeau · Antoine Saillenfest · Thomas Bonald
Abstract:
Abstract:Hierarchical classification offers an approach toincorporate the concept of mistake severity byleveraging a structured, labeled hierarchy. However, decoding in such settings frequently relieson heuristic decision rules, which may not alignwith taskspecific evaluation metrics. In this work,we propose a framework for the optimal decodingof an output probability distribution with respectto a target metric. We derive optimal decisionrules for increasingly complex prediction settings,providing universal algorithms when candidatesare limited to the set of nodes. In the most general case of predicting a *subset of nodes*, we focuson rules dedicated to the hierarchical $\mathrm{hF}_{\beta}$ scores,tailored to hierarchical settings. To demonstratethe practical utility of our approach, we conductextensive empirical evaluations, showcasing thesuperiority of our proposed optimal strategies, particularly in underdetermined scenarios. These results highlight the potential of our methods toenhance the performance and reliability of hierarchical classifiers in real-world applications.
Paperid:3054
Authors:Noam Levi
Abstract:
Neural scaling laws have garnered significant interest due to their ability to predict model performance as a function of increasing parameters, data, and compute. In this work, we propose a simple statistical ansatz based on memorization to study scaling laws in the context of inference. Specifically, how performance improves with multiple inference attempts. We explore the coverage, or pass@k metric, which measures the chance of success over repeated attempts and provide a motivation for the observed functional form of the inference scaling behavior of the coverage in large language models (LLMs) on reasoning tasks. We then define an "inference loss", which exhibits a power law decay as the number of trials increases, and connect this result with prompting costs. We further test the universality of our construction by conducting experiments on a simple generative model, and find that our predictions are in agreement with the empirical coverage curves in a controlled setting. Our simple framework sets the ground for incorporating inference scaling with other known scaling laws.
Paperid:3055
Authors:Xiuheng Wang · Ricardo Borsoi · Cédric Richard · Ali Sayed
Abstract:
Online distributed optimization is particularly useful for solving optimization problems with streaming data collected by multiple agents over a network. When the solutions lie on a Riemannian manifold, such problems become challenging to solve, particularly when efficiency and continuous adaptation are required. This work tackles these challenges and devises a diffusion adaptation strategy for decentralized optimization over general manifolds. A theoretical analysis shows that the proposed algorithm is able to approach network agreement after sufficient iterations, which allows a nonasymptotic convergence result to be derived. We apply the algorithm to the online decentralized principal component analysis problem and Gaussian mixture model inference. Experimental results with both synthetic and real data illustrate its performance.
Paperid:3056
Authors:Sepehr Elahi · Paula Mürmann · Patrick Thiran
Abstract:
The SusceptibleInfected-Susceptible (SIS) model is a widely used model for the spread of information and infectious diseases, particularly non-immunizing ones, on a graph. Given a highly contagious disease, a natural question is how to best vaccinate individuals to minimize the disease's extinction time. While previous works showed that the problem of optimal vaccination is closely linked to the NP-hardSpectral Radius Minimization(SRM) problem, they assumed that the graph is known, which is often not the case in practice. In this work, we consider the problem of minimizing the extinction time of an outbreak modeled by an SIS model where the graph on which the disease spreads is unknown and only the infection states of the vertices are observed. To this end, we split the problem into two: learning the graph and determining effective vaccination strategies. We propose a novel inclusion-exclusion-based learning algorithm and, unlike previous approaches, establish its sample complexity for graph recovery. We then detail an optimal algorithm for the SRM problem and prove that its running time is polynomial in the number of vertices for graphs with bounded treewidth. This is complemented by an efficient and effective polynomial-time greedy heuristic for any graph. Finally, we present experiments that numerically validate our learning and vaccination algorithms.
Paperid:3057
Authors:Arnob Ghosh · Mehrdad Moharrami
Abstract:
Abstract:We consider the setup where the agent seeks to maximize the overall expected reward subject to the constraint that the entropic risk associated with the total utility is above a certain threshold. First, we observe that the traditional primaldual-based approach might not work to obtain the regret and violation bound, unlike the risk-neutral setup as the value iteration-based approach with respect to the combined state-action value function does not provide the optimal solution. We thus consider an augmented CMDP where we augment the state space with a budget. We then optimize the initial budget with appropriate utility functions to obtain the entropic risk measure for the utility. We then propose a primal-dual-based algorithm for the augmented problem. Compared to the standard primal-dual-based approach for risk-neutral CMDP setting we also introduce a regularized dual update as strong duality does not hold for risk-sensitive constraints unlike the risk-neutral constraint. We show that our proposed algorithm achieves $\tilde{\mathcal{O}}(\sqrt{|S|^2AH^4K})$ regret and $\tilde{\mathcal{O}}\left(\dfrac{\exp(|\alpha|H)-1}{|\alpha|H}\sqrt{|S|^2AH^4}K^{3/4}\right)$ violation where $|S|$ is the cardinality of the state-space, $H$ is the length of each episode, $K$ is the number of episodes, $\alpha<0$ is the risk-aversion parameter. *To the best of our knowledge this is the first sublinear regret and violation result for the risk-sensitive CMDP problem.*
Paperid:3058
Authors:Ke Kaiqiang · qian lin · Zongkai Liu · Shenghong He · Chao Yu
Abstract:
Offline goalconditioned reinforcement learning (GCRL) learns a goal-conditioned value function to train policies for diverse goals with pre-collected datasets. Hindsight experience replay addresses the issue of sparse rewards by treating intermediate states as goals but fail to complete goal-stitching tasks where achieving goals requires stitching different trajectories. While cross-trajectory sampling is a potential solution that associates states and goals belonging to different trajectories, we demonstrate that this direct method degrades performance in goal-conditioned tasks due to the overestimation of values on unconnected pairs. To this end, we propose Conservative Goal-Conditioned Implicit Value Learning (CGCIVL), a novel algorithm that introduces a penalty term to penalize value estimation for unconnected state-goal pairs and leverages the quasimetric framework to accurately estimate values for connected pairs. Evaluations on OGBench, a benchmark for offline GCRL, demonstrate that CGCIVL consistently surpasses state-of-the-art methods across diverse tasks.
Paperid:3059
Authors:ji deng · Zhao Li · Ji Zhang · Jun Gao
Abstract:
Abstract:Macro placement, which involves optimizing the positions of modules, is a critical phase in modern integrated circuit design and significantly influences chip performance. The growing complexity of integrated circuits demands increasingly sophisticated placement solutions. Existing approaches have evolved along two primary paths (e.g., constructive and adjustment methods), but they face significant practical limitations that affect realworld chip design. Recent hybrid frameworks such as WireMask-EA have attempted to combine these strategies, but significant technical barriers still remain, including the computational overhead from separated layout adjustment and reconstruction that often require complete layout rebuilding, the inefficient exploration of design spaces due to random mutation operations, and the computational complexity of mask-based construction methods that limit scalability. To overcome these limitations, we introduce EGPlace, a novel evolutionary optimization framework that combines guided mutation strategies with efficient layout reconstruction. EGPlace introduces two key innovations: a greedy repositioning-guided mutation operator that systematically identifies and optimizes critical layout regions, and an efficient mask computation algorithm that accelerates layout evaluation. Our extensive evaluation using ISPD2005 and Ariane RISC-V CPU benchmarks demonstrate that EGPlace reduces wirelength by 10.8\% and 9.3\% compared to WireMask-EA and the state-of-the-art reinforcement learning-based constructive method EfficientPlace, respectively, while achieving speedups of 7.8$\times$ and 2.8$\times$ over these methods.
Paperid:3060
Authors:Sen Peng · Mingyue Wang · Jianfei He · Jijia Yang · Xiaohua Jia
Abstract:
Latent diffusion models have recently demonstrated superior capabilities in many downstream image synthesis tasks. However, customization of latent diffusion models using unauthorized data can severely compromise the privacy and intellectual property rights of data owners.Adversarial examples as protective perturbations have been developed to defend against unauthorized data usage by introducing imperceptible noise to customization samples, preventing diffusion models from effectively learning them.In this paper, we first reveal that the primary reason adversarial examples are effective as protective perturbations in latent diffusion models is the distortion of their latent representations, as demonstrated through qualitative and quantitative experiments.We then propose the Contrastive Adversarial Training (CAT) utilizing adapters as an adaptive attack against these protection methods, highlighting their lack of robustness. Extensive experiments demonstrate that our CAT method significantly reduces the effectiveness of protective perturbations in customization configurations, urging the community to reconsider and enhance the robustness of existing protective perturbation methods.
Paperid:3061
Authors:Ce Zhang · Yixin Han · Yafei Wang · Xiaodong Yan · Linglong Kong · Ting Li · Bei Jiang
Abstract:
Abstract:Randomized response (RR) mechanisms constitute a fundamental and effective technique for ensuring label differential privacy (LabelDP). However, existing RR methods primarily focus on the response labels while overlooking the influence of covariates and often do not fully address optimality. To address these challenges, this paper explores optimal LabelDP procedures using RR mechanisms, focusing on achieving optimal estimation and inference in binary response models. We first analyze the asymptotic behaviors of RR binary response models and then optimize the procedure by maximizing the trace of the Fisher Information Matrix within the $\varepsilon$and $(\varepsilon,\delta)$-LabelDP constraints. Our theoretical results indicate that the proposed methods achieve optimal LabelDP guarantees while maintaining statistical accuracy in binary response models under mild conditions. Furthermore, we develop private confidence intervals with nominal coverage for statistical inference. Extensive simulation studies and real-world applications confirm that our methods outperform existing approaches in terms of precise estimation, privacy protection, and reliable inference.
Paperid:3062
Authors:Hong-Ming Chiu · Hao Chen · Huan Zhang · Richard Zhang
Abstract:
Abstract:Neural network verifiers based on linear bound propagation scale impressively to massive models but can be surprisingly loose when neuron coupling is crucial. Conversely, semidefinite programming (SDP) verifiers capture interneuron coupling naturally, but their cubic complexity restricts them to only small models. In this paper, we propose SDP-CROWN, a novel hybrid verification framework that combines the tightness of SDP relaxations with the scalability of bound-propagation verifiers. At the core of SDP-CROWN is a new linear bound---derived via SDP principles---that explicitly captures $\ell_{2}$-norm-based inter-neuron coupling while adding only one extra parameter per layer. This bound can be integrated seamlessly into any linear bound-propagation pipeline, preserving the inherent scalability of such methods yet significantly improving tightness. In theory, we prove that our inter-neuron bound can be up to a factor of $\sqrt{n}$ tighter than traditional per-neuron bounds. In practice, when incorporated into the state-of-the-art $\alpha$-CROWN verifier, we observe markedly improved verification performance on large models with up to 65 thousand neurons and 2.47 million parameters, achieving tightness that approaches that of costly SDP-based methods.
Paperid:3063
Authors:Wenjie Wu · Dexuan Huo · Hong Chen
Abstract:
Abstract:Spiking Neural Networks (SNNs) have demonstrated remarkable potential across many domains, including computer vision and natural language processing, owing to their energy efficiency and biological plausibility. However, their application in longterm prediction tasks remains underexplored, which is primarily due to two critical challenges: (1) current SNN encoding methods are unable to effectively encode long temporal information, leading to increased computational complexity and energy consumption; (2) though Transformer-based models have achieved state-of-the-art accuracy in temporal prediction tasks, the absence of proper positional encoding for spiking self-attention restricts Spiking Transformer from effectively utilizing positional information, resulting in performance degradation. To address these challenges, we introduce an attention-free framework, **Spik**ing **F**ourier Network (**SpikF**), that encodes input sequences in patches and employs an innovative frequency domain selection mechanism to effectively utilize the sequential properties of time-series data. Extensive evaluations on eight well-established long-term prediction datasets demonstrate that SpikF achieves an averaged $1.9\%$ reduction in Mean Absolute Error (MAE) compared to state-of-the-art models, while lowering energy consumption by $75.05\%$.
Paperid:3064
Authors:Thomas Pouplin · Katarzyna Kobalczyk · Hao Sun · M van der Schaar
Abstract:
Abstract:Developing autonomous agents capable of performing complex, multistep decision-making tasks specified in natural language remains a significant challenge, particularly in realistic settings where labeled data is scarce and real-time experimentation is impractical. Existing reinforcement learning (RL) approaches often struggle to generalize to unseen goals and states, limiting their applicability. In this paper, we introduce $\textit{TEDUO}$, a novel training pipeline for offline language-conditioned policy learning in symbolic environments. Unlike conventional methods, $\textit{TEDUO}$ operates on readily available, unlabeled datasets and addresses the challenge of generalization to previously unseen goals and states. Our approach harnesses large language models (LLMs) in a dual capacity: first, as automatization tools augmenting offline datasets with richer annotations, and second, as generalizable instruction-following agents. Empirical results demonstrate that $\textit{TEDUO}$ achieves data-efficient learning of robust language-conditioned policies, accomplishing tasks beyond the reach of conventional RL frameworks or out-of-the-box LLMs alone.
Paperid:3065
Authors:Mikael Møller Høgsgaard · Andrea Paudice
Abstract:
Abstract:The Median of Means (MoM) is a mean estimator that has gained popularity in the context of heavytailed data. In this work, we analyze its performance in the task of simultaneously estimating the mean of each function in a class $\mathcal{F}$ when the data distribution possesses only the first $p$ moments for $p \in (1,2]$. We prove a new sample complexity bound using a novel symmetrization technique that may be of independent interest. Additionally, we present applications of our result to $k$-means clustering with unbounded inputs and linear regression with general losses, improving upon existing works.
Paperid:3066
Authors:Zheyang Xiong · Jack Cai · John Cooper · Albert Ge · Vasilis Papageorgiou · Zack Sifakis · Angeliki Giannou · Ziqian Lin · Liu Yang · Saurabh Agarwal · Grigorios Chrysos · Samet Oymak · Kangwook Lee · Dimitris Papailiopoulos
Abstract:
Large Language Models (LLMs) have demonstrated remarkable incontext learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability we term "task superposition". We provide empirical evidence of this phenomenon across various LLM families and scales and show that this phenomenon emerges even if we train the model to in-context learn one task at a time. We offer theoretical explanations that this capability is well within the expressive power of transformers. We also explore how LLMs internally compose task vectors during superposition. Furthermore, we show that larger models can solve more ICL tasks in parallel, and better calibrate their output distribution. Our findings offer insights into the latent capabilities of LLMs, further substantiate the perspective of "LLMs as superposition of simulators", and raise questions about the mechanisms enabling simultaneous task execution.
Paperid:3067
Authors:Harit Vishwakarma · Alan Mishler · Thomas Cook · Niccolo Dalmasso · Natraj Raman · Sumitra Ganesh
Abstract:
Large language models (LLMs) are empowering decisionmaking in several applications, including tool or API usage and answering multiple-choice questions (MCQs). However, incorrect outputs pose significant risks in high-stakes domains like healthcare and finance. To quantify LLM uncertainty and thereby mitigate these risks, recent works employ conformal prediction (CP), a model- and distribution-agnostic framework that uses LLM outputs to generate a \emph{prediction set} containing the true answer with high probability. Leveraging CP, we propose \emph{conformal revision of questions} (CROQ), which revises the question by narrowing down the available choices to those in the prediction set and asking the LLM the revised question. We expect LLMs to be more accurate on revised questions with fewer choices. Furthermore, we expect CROQ to be effective when the prediction sets from CP are small. Commonly used logit scores often lead to large sets, diminishing CROQ's effectiveness. To overcome this, we propose CP-OPT, an optimization framework to learn scores that minimize set sizes while maintaining coverage. Our extensive experiments on MMLU, ToolAlpaca, and TruthfulQA datasets with multiple LLMs show that CROQ improves accuracy over the standard inference, with more pronounced gains when paired with CP-OPT.
Paperid:3068
Authors:Kaiwen Tang · Zhanglu Yan · Weng-Fai Wong
Abstract:
Abstract:For reasons such as privacy, there are use cases for language models at the edge. This has given rise to small language models targeted for deployment in resourceconstrained devices where energy efficiency is critical. Spiking neural networks (SNNs) offer a promising solution due to their energy efficiency, and there are already works on realizing transformer-based models on SNNs. However, key operations like softmax and layer normalization (LN) are difficult to implement on neuromorphic hardware, and many of these early works sidestepped them. To address these challenges, we introduce Sorbet, a transformer-based spiking language model that is more neuromorphic hardware-compatible. Sorbet incorporates a novel shifting-based softmax called PTsoftmax and a BitShifting-based PowerNorm (BSPN), both designed to replace the respective energy-intensive operations. By leveraging knowledge distillation and model quantization, Sorbet achieved a highly compressed binary weight model that maintains competitive performance while achieving $27.16\times$ energy savings compared to BERT. We validate Sorbet through extensive testing on the GLUE benchmark and a series of ablation studies, demonstrating its potential as an energy-efficient solution for language model inference.
Paperid:3069
Authors:Ruibo Chen · Yihan Wu · Junfeng Guo · Heng Huang
Abstract:
Watermarking techniques offer a promising way to identify machinegenerated content via embedding covert information into the contents generated from language models (LMs). However, the robustness of the watermarking schemes has not been well explored. In this paper, we present De-mark, an advanced framework designed to remove n-gram-based watermarks effectively. Our method utilizes a novel querying strategy, termed random selection probing, which aids in assessing the strength of the watermark and identifying the red-green list within the n-gram watermark. Experiments on popular LMs, such as Llama3 and ChatGPT, demonstrate the efficiency and effectiveness of \methodname\ in watermark removal and exploitation tasks.
Paperid:3070
Authors:Sophia Hager · Aleem Khan · Andrew Wang · Nicholas Andrews
Abstract:
Most successful applications of deep learning involve similar training and test conditions. However, tasks such as biological sequence design involve searching for sequences that improve desirable properties beyond previously known values, which requires novel hypotheses thatextrapolatebeyond training data. In these settings, extrapolation may be achieved by using random search methods such as Markov chain Monte Carlo (MCMC), which, given an initial state, sample local transformations based on a proposal distribution. However, even with a welldesigned proposal, MCMC may struggle to explore structured state spaces efficiently. Rather than relying on stochastic search, it would be desirable to have a model that greedily optimizes the properties of interest, successfully extrapolating in as few steps as possible. We propose to learn such a model from the Markov chains resulting from MCMC search. Specifically, our approach uses selected states from Markov chains as a source of training data for an autoregressive inference network, which is then able to efficiently generate novel sequences at test time that extrapolate along the sequence-level properties of interest. The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization. We find that the learned inference network confers many of the same (and sometimes better) generalization benefits compared to the slow sampling process, but with the additional benefit of high sample efficiency.
Paperid:3071
Authors:Rui Min · Tianyu Pang · Chao Du · Qian Liu · Minhao Cheng · Min Lin
Abstract:
Abstract:Chatbot Arena is an open platform for evaluating LLMs by pairwise battles, in which users vote for their preferred response from two randomly sampled anonymous models. While Chatbot Arena is widely regarded as a reliable LLM ranking leaderboard, we show that crowdsourced voting can be *rigged* to improve (or decrease) the ranking of a target model $m\_{t}$. We first introduce a straightforward **targetonly rigging** strategy that focuses on new battles involving $m\_{t}$, identifying it via watermarking or a binary classifier, and exclusively voting for $m\_{t}$ wins. However, this strategy is practically inefficient because there are over $190$ models on Chatbot Arena and on average only about 1% of new battles will involve $m\_{t}$. To overcome this, we propose an **omnipresent rigging** strategy, exploiting the Elo rating mechanism of Chatbot Arena that any new vote on a battle can influence the ranking of the target model $m\_{t}$, even if $m\_{t}$ is not directly involved in the battle. We conduct experiments on around *1.7 million* historical votes from the Chatbot Arena Notebook, showing that omnipresent rigging strategy can improve model rankings by rigging only *hundreds of* new votes. While we have evaluated several defense mechanisms, our findings highlight the importance of continued efforts to prevent vote rigging.
Paperid:3072
Authors:Yongjae Shin · Jeonghye Kim · Whiyoung Jung · Sunghoon Hong · Deunsol Yoon · Youngsoo Jang · Geon-Hyeong Kim · Jongseong Chae · Youngchul Sung · Kanghoon Lee · Woohyung Lim
Abstract:
Reinforcement Learning (RL) has achieved notable success in tasks requiring complex decision making. Offline RL focuses on training agents using fixed datasets, thereby avoiding the risks and costs that stem from online interactions. However, offline RL is inherently limited by the dataset's quality, which can restrict an agent’s performance. Offlineto-online RL aims to bridge the gap between the cost-efficiency of offline RL and the performance potential of online RL by pre-training an agent offline before fine-tuning it through online interactions. Despite its promise, recent studies show that offline pre-trained agents often underperform during online fine-tuning due to inaccurate value function, with random initialization proving more effective in certain cases. In this work, we propose a novel method, Online Pre-Training for Offline-to-Online RL (OPT), to address the issue of inaccurate value estimation in offline pre-trained agents. OPT introduces a new learning phase, Online Pre-Training, which allows the training of a new value function that enhances the subsequent fine-tuning process. Implementation of OPT on TD3 and SPOT demonstrates an average 30\% improvement in performance across D4RL environments, such as MuJoCo, Antmaze, and Adroit.
Paperid:3073
Authors:Bardh Prenkaj · Efstratios Zaradoukas · Gjergji Kasneci
Abstract:
Counterfactual explainability seeks to uncover model decisions by identifying minimal changes to the input that alter the predicted outcome. This task becomes particularly challenging for graph data due to preserving structural integrity and semantic meaning. Unlike prior approaches that rely on forward perturbation mechanisms, we introduce Graph Inverse Style Transfer (GIST), the first framework to reimagine graph counterfactual generation as a backtracking process, leveraging spectral style transfer. By aligning the global structure with the original input spectrum and preserving local content faithfulness, GIST produces valid counterfactuals as interpolations between the input style and counterfactual content. Tested on 8 binary and multi-class graph classification benchmarks, GIST achieves a remarkable +7.6% improvement in the validity of produced counterfactuals and significant gains (+45.5%) in faithfully explaining the true class distribution. Additionally, GIST's backtracking mechanism effectively mitigates overshooting the underlying predictor's decision boundary, minimizing the spectral differences between the input and the counterfactuals. These results challenge traditional forward perturbation methods, offering a novel perspective that advances graph explainability.
Paperid:3074
Authors:Nathaniel Lahn · Sharath Raghvendra · Emma Saarinen · Pouyan Shirzadian
Abstract:
Abstract:Given a distance function $\mathrm{d}(\cdot,\cdot)$, the $p$Wasserstein distance is the $p$th root of the cost of the cheapest transport plan, where the cost of transporting a unit mass between points $a$ and $b$ is given by $\mathrm{d}(a,b)^p$. Despite its strong theoretical properties, $p$-Wasserstein distance, for $p \ge 2$, has two drawbacks that limit its practical use: It is very sensitive to noise and lacks highly scalable algorithms. In this paper, we present new algorithms for the $p$-Wasserstein distance. First, when $\mathrm{d}(\cdot,\cdot)$ is a metric, for any fixed $p$, we present a simple and novel $O(\log n)$-approximation algorithm to compute the $p$-Wasserstein distance between any two discrete distributions with support size $n$ in $O(n^2 \log U\log \Delta\log n)$ time; here, $\log U$ is the number of bits required to represent the probability values and $\Delta$ is the ratio of the largest to the smallest edge costs, given by $\mathrm{d}(\cdot,\cdot)$. This is the first near-quadratic time relative approximation algorithm for the $p$-Wasserstein distance with an approximation factor that is sub-linear in $n$.We use $p$ hierarchically well-separated trees to define a distance that approximates the $p$-Wasserstein cost within a factor of $O(\log n)$. We then design a simple dynamic weighted bichromatic closest pair data structure with respect to the new distance and combine it with the classical primal-dual framework for optimal transport to obtain our algorithm.Second, for any arbitrary distance $\mathrm{d}(\cdot,\cdot)$, we show that a recent noise-resistant variant of the $p$-Wasserstein distance called the $p$-RPW distance can be approximated within an additive error of $\delta$ in $O(n^2/\delta^3)$ time. This is in contrast to the $p$-Wasserstein distance, where all existing combinatorial approaches require $\Omega(n^2/\delta^p)$ time.
Paperid:3075
Authors:Zhongtian Ma · Qiaosheng Zhang · Bocheng Zhou · Yexin Zhang · Shuyue Hu · Zhen Wang
Abstract:
Abstract:Despite the growing popularity of graph attention mechanisms, their theoretical understanding remains limited. This paper aims to explore the conditions under which these mechanisms are effective in node classification tasks through the lens of Contextual Stochastic Block Models (CSBMs). Our theoretical analysis reveals that incorporating graph attention mechanisms is *not universally beneficial*. Specifically, by appropriately defining *structure noise* and *feature noise* in graphs, we show that graph attention mechanisms can enhance classification performance when structure noise exceeds feature noise. Conversely, when feature noise predominates, simpler graph convolution operations are more effective. Furthermore, we examine the oversmoothing phenomenon and show that, in the high signal-to-noise ratio (SNR) regime, graph convolutional networks suffer from over-smoothing, whereas graph attention mechanisms can effectively resolve this issue. Building on these insights, we propose a novel multi-layer Graph Attention Network (GAT) architecture that significantly outperforms single-layer GATs in achieving *perfect node classification* in CSBMs, relaxing the SNR requirement from $\omega(\sqrt{\log n})$ to $\omega(\sqrt{\log n} / \sqrt[3]{n})$. To our knowledge, this is the first study to delineate the conditions for perfect node classification using multi-layer GATs. Our theoretical contributions are corroborated by extensive experiments on both synthetic and real-world datasets, highlighting the practical implications of our findings.
Paperid:3076
Authors:Xize Liang · Chao Chen · Shuang Qiu · Jie Wang · Yue Wu · Zhihang Fu · Hanzhu Chen · Feng Wu · Jieping Ye
Abstract:
The prevalent noise in the preference data unavoidably poses significant challenges to the preference alignment of large language models (LLMs). Existing efforts for this problem either marginally alleviate the impact of noise without noise reduction, or rely on external LLMs that incur substantial computational costs. To address these challenges, we proposeRObustPreferenceOptimization (ROPO), an iterative alignment approach that integratesnoisetoleranceandnoise filteringwithout the aid of external models. Specifically, ROPO first formulates the training process with adaptive noise reduction as an optimization problem, which can be efficiently solved in an iterative paradigm. Then, to equip this solving process with noise-tolerance and noise-identification capabilities, we derive a robust loss that suppresses the gradients from samples with high uncertainty. We demonstrate both empirically and theoretically that the derived loss is key to the noise-tolerance and effective filtering of noisy samples. The derived loss further inspires a robustness-guided rejection sampling technique to compensate for the potential important information in discarded queries. Extensive experiments on several widely-used datasets and model architectures demonstrate that ROPO significantly outperforms all baselines underfourpractical noise settings and the random symmetric noise, with its advantage increasing as the noise rate increases.
Paperid:3077
Authors:Yun Qu · Cheems Wang · Yixiu Mao · Yiqin Lv · Xiangyang Ji
Abstract:
Task robust adaptation is a longstanding pursuit in sequential decision-making.Some risk-averse strategies, e.g., the conditional value-at-risk principle, are incorporated in domain randomization or meta reinforcement learning to prioritize difficult tasks in optimization, which demand costly intensive evaluations.The efficiency issue prompts the development of robust active task sampling to train adaptive policies, where risk-predictive models can surrogate policy evaluation. This work characterizes robust active task sampling as a secret Markov decision process, posits theoretical and practical insights, and constitutes robustness concepts in risk-averse scenarios.Importantly, we propose an easy-to-implement method, referred to as Posterior and Diversity Synergized Task Sampling (PDTS), to accommodate fast and robust sequential decision-making.Extensive experiments show that PDTS unlocks the potential of robust active task sampling, significantly improves the zero-shot and few-shot adaptation robustness in challenging tasks, and even accelerates the learning process under certain scenarios.
Paperid:3078
Authors:Qilong Wu · Yiyang Shao · Jun Wang · Xiaobo Sun
Abstract:
Leveraging highquality joint representations from multimodal data can greatly enhance model performance in various machine-learning based applications. Recent multimodal learning methods, based on the multimodal information bottleneck (MIB) principle, aim to generate optimal MIB with maximal task-relevant information and minimal superfluous information via regularization. However, these methods often set regularization weights in an \textit{ad hoc} manner and overlook imbalanced task-relevant information across modalities, limiting their ability to achieve optimal MIB. To address this gap, we propose a novel multimodal learning framework, Optimal Multimodal Information Bottleneck (OMIB), whose optimization objective guarantees the achievability of optimal MIB by setting the regularization weight within a theoretically derived bound. OMIB further addresses imbalanced task-relevant information by dynamically adjusting regularization weights per modality, ensuring the inclusion of all task-relevant information. Moreover, we establish a solid information-theoretical foundation for OMIB's optimization and implement it under the variational approximation framework for computational efficiency. Finally, we empirically validate the OMIB’s theoretical properties on synthetic data and demonstrate its superiority over the state-of-the-art benchmark methods in various downstream tasks. The codes are available in Supplementary Material.
Paperid:3079
Authors:Perampalli Shravan Nayak · Xiangru Jian · Kevin Qinghong Lin · Juan A. Rodriguez · Montek Kalsi · Nicolas Chapados · M. Özsu · Aishwarya Agrawal · David Vazquez · Christopher Pal · Perouz Taslakian · Spandana Gella · Sai Rajeswar Mudumba
Abstract:
Developing autonomous agents that can navigate diverse Graphical User Interfaces (GUIs) and solve complex tasks is essential for understanding and accelerating human workflows.Currently, most topperforming models and benchmarks primarily focus on either web or mobile platforms, while desktop platforms -- equally prevalent in user interfaces -- are often overlooked due to licensing issues and the high cost of data collection. The complexity and variability of desktop GUIs present a major challenge in automation, making high-quality datasets an important step toward addressing these capabilities.In this work, we introduce UI-Vision, a comprehensive desktop-centric and license-permissive GUI understanding benchmark designed to tackle these challenges. UI-Vision features:(i) high-quality action trajectories recorded through human demonstrations and refined via expert human annotators;(ii) a wide range of annotations, including bounding boxes, action trajectories, and layout information; and(iii) significant diversity, spanning 83 software applications.Our evaluation reveals significant limitations in the ability of current models to effectively handle desktop environments, highlighting the considerable progress needed to achieve truly autonomous GUI agents capable of generalizing from visual cues.To support the broader research community, all our data and resources will be made fully open-source.
Paperid:3080
Authors:Qinglong Liu · Cong Xu · Wenhao Jiang · Kaixuan Wang · Lin Ma · Haifeng Li
Abstract:
Realworld time series inherently exhibit significant non-stationarity, posing substantial challenges for forecasting. To address this issue, this paper proposes a novel prediction framework, TimeStacker, designed to overcome the limitations of existing models in capturing the characteristics of non-stationary signals. By employing a unique stacking mechanism, TimeStacker effectively captures global signal features while thoroughly exploring local details. Furthermore, the framework integrates a frequency-based self-attention module, significantly enhancing its feature modeling capabilities. Experimental results demonstrate that TimeStacker achieves outstanding performance across multiple real-world datasets, including those from the energy, finance, and weather domains. It not only delivers superior predictive accuracy but also exhibits remarkable advantages with fewer parameters and higher computational efficiency.
Paperid:3081
Authors:Jingrong Xie · Han Ji · Yanan Sun
Abstract:
Abstract:Automated architecture design methods, especially neural architecture search, have attracted increasing attention. However, these methods naturally need to evaluate numerous candidate architectures during the search process, thus computationally extensive and timeconsuming. In this paper, we propose a prior knowledge guided neural architecture generation method to generate high-performance architectures without any search and evaluation process. Specifically, in order to identify valuable prior knowledge for architecture generation, we first quantify the contribution of each component within an architecture to its overall performance. Subsequently, a diffusion model guided by prior knowledge is presented, which can easily generate high-performance architectures for different computation tasks. Extensive experiments on new search spaces demonstrate that our method achieves superior accuracy over state-of-the-art methods. For example, we only need $0.004$ GPU Days to generate architecture with $76.1\%$ top-1 accuracy on ImageNet and $97.56\%$ on CIFAR-10. Furthermore, we can find competitive architecture for more unseen search spaces, such as TransNAS-Bench-101 and NATS-Bench, which demonstrates the broad applicability of the proposed method.
Paperid:3082
Authors:Anders Aamand · Justin Chen · Siddharth Gollapudi · Sandeep Silwal · Hao WU
Abstract:
Abstract:We design improved approximation algorithms for NPhard graph problems by incorporating predictions (e.g., learned from past data). Our prediction model builds upon and extends the $\varepsilon$-prediction framework by Cohen-Addad, d'Orsi, Gupta, Lee, and Panigrahi (NeurIPS 2024). We consider an edge-based version of this model, where each edge provides two bits of information, corresponding to predictions about whether each of its endpoints belong to an optimal solution. Even with weak predictions where each bit is only $\varepsilon$-correlated with the true solution, this information allows us to break approximation barriers in the standard setting. We develop algorithms with improved approximation ratios for MaxCut, Vertex Cover, Set Cover, and Maximum Independent Set problems (among others). Across these problems, our algorithms share a unifying theme, where we separately satisfy constraints related to high degree vertices (using predictions) and low-degree vertices (without using predictions) and carefully combine the answers.
Paperid:3083
Authors:Marvin F, da Silva · Felix Dangel · Sageev Oore
Abstract:
The concept of sharpness has been successfully applied to traditional architectures like MLPs and CNNs to predict their generalization. For transformers, however, recent work reported weak correlation between flatness and generalization. We argue that existing sharpness measures fail for transformers, because they have much richer symmetries in their attention mechanism that induce directions in parameter space along which the network or its loss remain identical. We posit that sharpness must account fully for these symmetries, and thus we redefine it on a quotient manifold that results from quotienting out the transformer symmetries, thereby removing their ambiguities. Leveraging tools from Riemannian geometry, we propose a fully general notion of sharpness, in terms of a geodesic ball on the symmetrycorrected quotient manifold. In practice, we need to resort to approximating the geodesics. Doing so up to first order yields existing adaptive sharpness measures, and we demonstrate that including higher-order terms is crucial to recover correlation with generalization. We present results on diagonal networks with synthetic data, and show that our geodesic sharpness reveals strong correlation for real-world transformers on both text and image classification tasks.
Paperid:3084
Authors:Wenhao SUN · Rong-Cheng Tu · Jingyi Liao · Zhao Jin · Dacheng Tao
Abstract:
Diffusion Transformers (DiTs) have proven effective in generating highquality videos but are hindered by high computational costs. Existing video diffusion sampling acceleration methods often rely on costly fine-tuning or exhibit limited generalization capabilities. We propose Asymmetric Reduction and Restoration (AsymRnR),a training-free and model-agnostic method to accelerate video DiTs. It builds on the observation that redundancies of feature tokens in DiTs vary significantly across different model blocks, denoising steps, and feature types. Our AsymRnR asymmetrically reduces redundant tokens in the attention operation, achieving acceleration with negligible degradation in output quality and, in some cases, even improving it. We also tailored a reduction schedule to distribute the reduction across components adaptively. To further accelerate this process, we introduce a matching cache for more efficient reduction. Backed by theoretical foundations and extensive experimental validation, AsymRnR integrates into state-of-the-art video DiTs and offers substantial speedup.
Paperid:3085
Authors:Jialin Zhao · Yingtao Zhang · Carlo Cannistraci
Abstract:
Abstract:The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Lowrank pruning has gained attention for its tensor coherence and GPU compatibility across all densities. However, low-rank pruning has struggled to match the performance of semi-structured pruning, often doubling perplexity (PPL) at similar densities. In this paper, we propose Pivoting Factorization (**PIFA**), a novel **lossless** meta low-rank representation that unsupervisedly learns a **compact** form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving an additional **24.2\%** memory savings and **24.6\%** faster inference over low-rank layers at $r/d = 0.5$, thereby significantly enhancing performance at the same density. To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free low-rank reconstruction method that minimizes error accumulation (**M**). **MPIFA**, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods and, for the first time, achieves performance comparable to semi-structured pruning, while surpassing it in GPU efficiency and compatibility.
Paperid:3086
Authors:Haorun Cai · Han-Jia Ye
Abstract:
Deep tabular models have demonstrated remarkable success on i.i.d. data, excelling in a variety of structured data tasks. However, their performance often deteriorates under temporal distribution shifts, where trends and periodic patterns are present in the evolving data distribution over time.In this paper, we explore the underlying reasons for this failure in capturing temporal dependencies. We begin by investigating the training protocol, revealing a key issue in how the data is split for model training and validation.While existing approaches typically use temporal ordering for splitting, we show that even a random split significantly improves model performance. By accounting for reducing training lag and validation bias to achieve better generalization ability, our proposed splitting protocol offers substantial improvements across a variety of methods.Furthermore, we analyses how temporal data affects deep tabular representations, uncovering that these models often fail to capture crucial periodic and trend information. To address this gap, we introduce a plugand-play temporal embedding based on Fourier series expansion to learn and incorporate temporal patterns, offering an adaptive approach to handle temporal shifts.Our experiments demonstrate that this temporal embedding, combined with the improved splitting strategy, provides a more effective and robust framework for learning from temporal tabular data.
Paperid:3087
Authors:Vladimir Zaigrajew · Hubert Baniecki · Przemyslaw Biecek
Abstract:
Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing largescale vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a new state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining ~80% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA.
Paperid:3088
Authors:YUTONG WU · Jie Zhang · Yiming Li · Chao Zhang · Qing Guo · Han Qiu · Nils Lukas · Tianwei Zhang
Abstract:
Vision Language Model (VLM) Agents are stateful, autonomous entities capable of perceiving and interacting with their environments through vision and language.Multiagent systems comprise specialized agents who collaborate to solve a (complex) task. A core security property isrobustness, stating that the system maintains its integrity during adversarial attacks. Multi-agent systems lack robustness, as a successful exploit against one agent can spread andinfectother agents to undermine the entire system's integrity. We propose a defense Cowpox to provably enhance the robustness of a multi-agent system by a distributed mechanism that improves therecovery rateof agents by limiting the expected number of infections to other agents.The core idea is to generate and distribute a specialcure samplethat immunizes an agent against the attack before exposure. We demonstrate the effectiveness of Cowpox empirically and provide theoretical robustness guarantees.
Paperid:3089
Authors:Max Milkert · David Hyde · Forrest Laine
Abstract:
Abstract:In a neural network with ReLU activations, the number of piecewise linear regions in the output can grow exponentially with depth.However, this is highly unlikely to happen when the initial parameters are sampled randomly, which therefore often leads to the use of networks that are unnecessarily large. To address this problem, we introduce a novel parameterization of the network that restricts its weights so that a depth $d$ network produces exactly $2^d$ linear regions at initialization and maintains those regions throughout training under the parameterization. This approach allows us to learn approximations of convex, onedimensional functions that are several orders of magnitude more accurate than their randomly initialized counterparts. We further demonstrate how to extend our approach to multidimensional and non-convex functions, allowing it to replace the dense layers in other networks; preliminary improvements are shown for image classification on CIFAR-10 and ImageNet.
Paperid:3090
Authors:Chenlu Ye · Yujia Jin · Alekh Agarwal · Tong Zhang
Abstract:
Abstract:Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range $[0, R]$, and their regret scales polynomially with this reward range $R$. However, many practical scenarios naturally involve heavytailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni's estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$ as well as the number of rounds $T$. For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound.
Paperid:3091
Authors:Qichao Wang · Ziqiao Meng · Wenqian Cui · Yifei Zhang · Pengcheng Wu · Bingzhe Wu · Irwin King · Liang Chen · Peilin Zhao
Abstract:
Drawing inspiration from the impressive capabilities of GPT4, the idea of equipping speech language models to engage in natural, fluid spoken interactions with human users has gained considerable interest. A range of speech language models have recently been proposed, achieving initial success in facilitating seamless interactions. However, current models have yet to fully leverage dual-channel speech data, which inherently encapsulate human conversation patterns. In this study, we conduct a systematic exploration of dual-channel speech data usage within the context of contemporary large language models (LLMs), introducing a novel generative model for dual-channel speech learning with decoder-only architectures for the first time. We assess our methods using existing benchmarks, and empirical results show that dual-channel speech learning can enhance the conversational abilities of LLMs in numerous ways. Furthermore, in comparison to existing dual-channel modeling strategies, our innovative approaches offer significantly higher training and inference efficiency.
Paperid:3092
Authors:Mark Schoene · Babak Rahmani · Heiner Kremer · Fabian Falck · Hitesh Ballani · Jannes Gladrow
Abstract:
Statespace models (SSMs) and transformers dominate the language modeling landscape. However, they are constrained to a lower computational complexity than classical recurrent neural networks (RNNs), limiting their expressivity. In contrast, RNNs lack parallelization during training, raising fundamental questions about the trade off between parallelization and expressivity. We propose implicit SSMs, which iterate a transformation until convergence to a fixed point. Theoretically, we show that implicit SSMs implement the non-linear state-transitions of RNNs. Empirically, we find that only approximate fixed-point convergence suffices, enabling the design of a scalable training curriculum that largely retains parallelization, with full convergence required only for a small subset of tokens. Our approach demonstrates superior state-tracking capabilities on regular languages, surpassing transformers and SSMs. We further scale implicit SSMs to natural language reasoning tasks and pretraining of large-scale language models up to 1.3B parameters on 207B tokens - representing, to our knowledge, the largest implicit model trained to date. Notably, our implicit models outperform their explicit counterparts on standard benchmarks.
Paperid:3093
Authors:Mouxiang Chen · Lefei Shen · Zhuo Li · Xiaoyun Wang · Jianling Sun · Chenghao Liu
Abstract:
Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either repurpose large language models (LLMs) or build largescale time series datasets to develop TSF foundation models for universal forecasting. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. This paper explores a new road to building a TSF foundation model from rich, high-quality natural images. Our key insight is that a visual masked autoencoder, pre-trained on the ImageNet dataset, can naturally be a numeric series forecaster. By reformulating TSF as an image reconstruction task, we bridge the gap between image pre-training and TSF downstream tasks. Surprisingly, without further adaptation in the time series domain, the proposed VisionTS could achieve better zero-shot forecast performance than existing TSF foundation models. With fine-tuning for one epoch, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. Extensive experiments reveal intrinsic similarities between images and real-world time series, suggesting that visual models may offer a "free lunch" for TSF and highlight the potential for future cross-modality research. Our code is available in the Supplementary Material.
Paperid:3094
Authors:Si-Yang Liu · Han-Jia Ye
Abstract:
TabPFN has emerged as a promising incontext learning model for tabular data, capable of directly predicting the labels of test samples given labeled training examples. It has demonstrated competitive performance, particularly on small-scale classification tasks. However, despite its effectiveness, TabPFN still requires further refinement in several areas, including handling high-dimensional features, aligning with downstream datasets, and scaling to larger datasets.In this paper, we revisit existing variants of TabPFN and observe that most approaches focus either on reducing bias or variance, often neglecting the need to address the other side, while also increasing inference overhead. To fill this gap, we propose Beta (Bagging andEncoder-based Fine-tuning forTabPFNAdaptation), a novel and effective method designed tominimize both bias and variance. To reduce bias, we introduce a lightweight encoder to better align downstream tasks with the pre-trained TabPFN. By increasing the number of encoders in a lightweight manner, Beta mitigates variance, thereby further improving the model’s performance. Additionally, bootstrapped sampling is employed to further reduce the impact of data perturbations on the model, all while maintaining computational efficiency during inference. Our approach enhances TabPFN’s ability to handle high-dimensional data and scale to larger datasets. Experimental results on over 200 benchmark classification datasets demonstrate that Beta either outperforms or matches state-of-the-art methods.
Paperid:3095
Authors:Seungjun Shin · Jaehoon Oh · Dokwan Oh
Abstract:
Attention mechanisms are central to the success of large language models (LLMs), enabling them to capture intricate token dependencies and implicitly assign importance to each token. Recent studies have revealed the sink token, which receives disproportionately high attention despite their limited semantic role. In this paper, we first expand the relationship between the sink token and other tokens, moving beyond attention to explore their similarity in hidden states, considering the layer depth. We observe that as the layers get deeper, the cosine similarity between the normalized hidden states of the sink token and those of other tokens increases, and that the normalized hidden states of the sink token exhibit negligible changes. These imply that other tokens consistently are directed toward the sink token throughout the layers. Next, we propose a dynamic token selection method, called OrthoRank, using these findings to select important tokens. Specifically, in a certain layer, we define token importance by the speed at which the token moves toward the sink token. This is converted into orthogonality with the sink token, meaning that tokens that are more orthogonal to the sink token are assigned greater importance. Finally, through extensive experiments, we demonstrated that our method results in lower perplexity and higher zeroshot accuracy compared to layer pruning methods at the same sparsity ratio with comparable throughput, while also achieving superior performance on LongBench.
Paperid:3096
Authors:Leyan Pan · Vijay Ganesh · Jacob Abernethy · Chris Esposo · Wenke Lee
Abstract:
We formally study the logical reasoning capabilities of decoderonly Transformers in the context of the boolean satisfiability (SAT) problem. First, we prove by construction that decoder-only Transformers can decide 3-SAT, in a non-uniform model of computation, using backtracking and deduction via Chain-of-Thought (CoT).Second, we implement our construction as a PyTorch model with a tool (PARAT) that we designed to empirically demonstrate its correctness and investigate its properties.Third, rather than \textit{programming} a transformer to reason, we evaluate empirically whether it can be \textit{trained} to do so by learning directly from algorithmic traces (``reasoning paths'') from our theoretical construction. The trained models demonstrate strong out-of-distribution generalization on problem sizes seen during training but has limited length generalization, which is consistent with the implications of our theoretical result.
Paperid:3097
Authors:Wonjun Lee · Doehyeon Lee · Eugene Choi · Sangyoon Yu · Ashkan Yousefpour · Haon Park · Bumsub Ham · Suhyun Kim
Abstract:
Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that induce harmful outputs. Existing safety benchmarks for VLMs primarily rely on automated evaluation methods, but these methods struggle to detect implicit harmful content or produce inaccurate evaluations. Therefore, we found that existing benchmarks have low levels of harmfulness, ambiguous data, and limited diversity in imagetext pair combinations. To address these issues, we propose the ELITE benchmark, a high-quality safety evaluation benchmark for VLMs, underpinned by our enhanced evaluation method, the ELITE evaluator. The ELITE evaluator explicitly incorporates a toxicity score to accurately assess harmfulness in multimodal contexts, where VLMs often provide specific, convincing, but unharmful descriptions of images. We filter out ambiguous and low-quality image-text pairs from existing benchmarks using the ELITE evaluator and generate diverse combinations of safe and unsafe image-text pairs. Our experiments demonstrate that the ELITE evaluator achieves superior alignment with human evaluations compared to prior automated methods, and the ELITE benchmark offers enhanced benchmark quality and diversity. By introducing ELITE, we pave the way for safer, more robust VLMs, contributing essential tools for evaluating and mitigating safety risks in real-world applications.
Paperid:3098
Authors:Chenrui Wu · Haishuai Wang · Xiang Zhang · Chengqi Zhang · Jiajun Bu
Abstract:
Time series analysis is crucial across various fields like energy, environment, transportation, finance and health. Deep learning has significantly advanced this field, particularly, the Time Series Foundation Model (TSFM) excels in multiple domains due to extensive pretraining. In this work, we focus on TSFM's challenges in medical practice: limited computing resources and medical data privacy. TSFM variants include fine-tuned models and those pre-trained for rapid deployment on diverse data. There may not be enough computing resources to train physiological signals locally in hospitals, and generalized TSFM is still inferior to task-specific methods on private, imbalanced local data. To address this, we propose PhysioPFM, a framework for efficiently personalizing TSFM. Our approach involves low-rank pre-training on public datasets, generator training by trained LoRA weights, and efficient weight generation via local data. Experimental results demonstrate that integrating generated models with TSFM enhances performance, and transferability, and reduces the need for additional sensitive data training.
Paperid:3099
Authors:Yicheng Pan · Renjie Chen · Pengyu Long · Bingchen Fan
Abstract:
Overlap and hierarchy are two prevalent phenomena in clustering, and usually coexist in a single system. There are several studies on each of them separately, but it is unclear how to characterize and evaluate the hybrid structures yet. To address this issue, we initiate the study of hierarchical overlapping clustering on graphs by introducing a new cost function for it. We show the rationality of our cost function via several intuitive properties, and develop an approximation algorithm that achieves a constant approximation factor for its dual version. Our algorithm is a recursive process of overlapping bipartition based on local search, which makes a speedup version of it extremely scalable. Our experiments demonstrate that the speed-up algorithm has significantly better performances than all the baseline methods in both effectiveness and scalability on synthetic and real datasets.
Paperid:3100
Authors:Jiancong Xiao · Bojian Hou · Zhanliang Wang · Ruochen Jin · Qi Long · Weijie Su · Li Shen
Abstract:
One of the key technologies for the success of Large Language Models (LLMs) is preference alignment. However, a notable side effect of preference alignment is poor calibration: while the pretrained models are typically well-calibrated, LLMs tend to become poorly calibrated after alignment with human preferences. In this paper, we investigate why preference alignment affects calibration and how to address this issue. For the first question, we observe that the preference collapse issue in alignment undesirably generalizes to the calibration scenario, causing LLMs to exhibit overconfidence and poor calibration. To address this, we demonstrate the importance of fine-tuningwith domain-specific knowledge to alleviate the overconfidence issue. To further analyze whether this affects the model's performance, we categorize models into two regimes: calibratable and non-calibratable, defined by bounds of Expected Calibration Error (ECE). In the calibratable regime, we propose a calibration-aware fine-tuning approach to achieve proper calibration without compromising LLMs' performance. However, as models are further fine-tuned for better performance, they enter the non-calibratable regime. For this case, we develop an EM-algorithm-based ECE regularization for the fine-tuning loss to maintain low calibration error. Extensive experiments validate the effectiveness of the proposed methods.
Paperid:3101
Authors:Wenbo Lu · Shaoyi Zheng · Yuxuan Xia · Shengjie Wang
Abstract:
Diffusion models excel in highfidelity image generation but face scalability limits due to transformers’ quadratic attention complexity. Plug-and-play token reduction methods like ToMeSD and ToFU achieve FLOP reduction by merging redundant tokens in generated images but rely on GPU-inefficient operations (e.g., sorting, scattered writes), causing overhead that negates theoretical speedups when paired with optimized attention implementation (e.g., FlashAttention). To bridge this gap, we propose Token Merge with Attention (ToMA), an off-shelf method that rethinks token reduction for GPU-aligned efficiency, with three major contributions: 1) a reformulation of token merge as submodular optimization framework, offering diversified token 2) merge/unmerge as an attention like linear transformation via GPU-friendly matrix operations, and 3) exploiting latent locality and temporal redundancy (pattern reuse) to minimize overhead. ToMA reduces SDXL/Flux-1 generation latency by 24\%/23\% (DINO ( \Delta < ) 0.07), outperforming prior methods. This work bridges the gap between theoretical and practical efficiency for diffusion transformers.
Paperid:3102
Authors:Vincenzo De Paola · Riccardo Zamboni · Mirco Mutti · Marcello Restelli
Abstract:
Abstract:Parallel data collection has redefined Reinforcement Learning (RL), unlocking unprecedented efficiency and powering breakthroughs in largescale real-world applications. In this paradigm, $N$ identical agents operate in $N$ replicas of an environment simulator, accelerating data collection by a factor of $N$. A critical question arises: *Does specializing the policies of the parallel agents hold the key to surpass the $N$ factor acceleration?*In this paper, we introduce a novel learning framework that maximizes the entropy of collected data in a parallel setting. Our approach carefully balances the entropy of individual agents with inter-agent diversity, effectively minimizing redundancies. The latter idea is implemented with a centralized policy gradient method, which shows promise when evaluated empirically against systems of identical agents, as well as synergy with batch RL techniques that can exploit data diversity.Finally, we provide an original concentration analysis that shows faster rates for specialized parallel sampling distributions, which supports our methodology and may be of independent interest.
Paperid:3103
Authors:Nikola Milosevic · Johannes Müller · Nico Scherf
Abstract:
Reinforcement Learning (RL) agents can solve diverse tasks but often exhibit unsafe behavior. Constrained Markov Decision Processes (CMDPs) address this by enforcing safety constraints, yet existing methods either sacrifice reward maximization or allow unsafe training. We introduce Constrained Trust Region Policy Optimization (CTRPO), which reshapes the policy space geometry to ensure trust regions contain only safe policies, guaranteeing constraint satisfaction throughout training. We analyze its theoretical properties and connections to TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO). Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns.
Paperid:3104
Authors:Ulzee An · Moonseong Jeong · Simon Lee · Aditya Gorla · Yuzhe Yang · Sriram Sankararaman
Abstract:
Current challenges in developing foundational models for volumetric imaging data, such as magnetic resonance imaging (MRI), stem from the computational complexity of stateof-the-art architectures in high dimensions and curating sufficiently large datasets of volumes.To address these challenges, we introduce Raptor (Random Planar Tensor Reduction), a train-free method for generating semantically rich embeddings for volumetric data. Raptor leverages a frozen 2D foundation model, pretrained on natural images, to extract visual tokens from individual cross-sections of medical volumes. These tokens are then spatially compressed using random projections, significantly reducing computational complexity while retaining rich semantic information.Extensive experiments on 10 diverse medical volume tasks verify the superior performance of Raptor over state-of-the-art methods, including those pretrained exclusively on medical volumes (+3 SuPreM, +6 MISFM, +10 Merlin, +13 VoCo, and +14 SLIViT),while entirely bypassing the need for costly training. Our results highlight Raptor's effectiveness and versatility as a foundation for advancing deep learning-based methods for medical volumes.
Paperid:3105
Authors:Haoqi Wang · Mathieu Salzmann · Tong Zhang
Abstract:
Large transformer models are known to produce highnorm tokens. In vision transformers (ViTs), such tokens have been mathematically modeled through the singular vectors of the linear approximations of layers. However, in large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored, and their different properties from those of ViTs require a new analysis framework. In this paper, we provide both theoretical insights and empirical validation across a range of recent models, leading to the following observations: i) The layer-wise singular direction predicts the abrupt explosion of token norms in LLMs. ii) The negative eigenvalues of a layer explain its sudden decay. iii) The computational pathways leading to high-norm tokens differ between initial and noninitial tokens. iv) High-norm tokens are triggered by the right leading singular vector of the matrix approximating the corresponding modules. We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures. Our findings not only advance the understanding of singular defects in LLMs but also open new avenues for their application. We expect that this work will stimulate further research into the internal mechanisms of LLMs and will therefore publicly release our code.
Paperid:3106
Authors:Yanbin Wei · Xuehao Wang · Zhan Zhuang · Yang Chen · Shuhao Chen · Yulong Zhang · James Kwok · Yu Zhang
Abstract:
Messagepassing graph neural networks (MPNNs) and structural features (SFs) are cornerstones for the link prediction task. However, as a common and intuitive mode of understanding, the potential of visual perception has been overlooked in the MPNN community. For the first time, we equip MPNNs with vision structural awareness by proposing an effective framework called Graph Vision Network (GVN), along with a more efficient variant (E-GVN). Extensive empirical results demonstrate that with the proposed frameworks, GVN consistently benefits from the vision enhancement across seven link prediction datasets, including challenging large-scale graphs. Such improvements are compatible with existing state-of-the-art (SOTA) methods and GVNs achieve new SOTA results, thereby underscoring a promising novel direction for link prediction.
Paperid:3107
Authors:Jue Gong · Jingkai Wang · Zheng Chen · Xing Liu · Hong Gu · Yulun Zhang · Xiaokang Yang
Abstract:
Human body restoration, as a specific application of image restoration, is widely applied in practice and plays a vital role across diverse fields. However, thorough research remains difficult, particularly due to the lack of benchmark datasets. In this study, we propose a highquality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. Using this pipeline, we constructed a person-based restoration with sophisticated objects and natural activities (PERSONA) dataset, which includes training, validation, and test sets. The dataset significantly surpasses other human-related datasets in both quality and content richness. Finally, we propose OSDHuman, a novel one-step diffusion model for human body restoration. Specifically, we propose a high-fidelity image embedder (HFIE) as the prompt generator to better guide the model with low-quality human image information, effectively avoiding misleading prompts. Experimental results show that OSDHuman outperforms existing methods in both visual quality and quantitative metrics. The dataset and code will be released soon.
Paperid:3108
Authors:Peiyan Zhang · Haibo Jin · Leyang Hu · Xinnuo Li · Liying Kang · Man Luo · Yangqiu Song · Haohan Wang
Abstract:
Recent advancements in large language models (LLMs) have significantly enhanced the ability of LLMbased systems to perform complex tasks through natural language processing and tool interaction. However, optimizing these LLM-based systems for specific tasks remains challenging, often requiring manual interventions like prompt engineering and hyperparameter tuning. Existing automatic optimization methods, such as textual feedback-based techniques (e.g., TextGrad), tend to focus on immediate feedback, analogous to using immediate derivatives in traditional numerical gradient descent. However, relying solely on such feedback can be limited when the adjustments made in response to this feedback are either too small or fluctuate irregularly, potentially slowing down or even stalling the optimization process. To overcome these challenges, more adaptive methods are needed, especially in situations where the system's response is evolving slowly or unpredictably. In this paper, we introduce REVOLVE, an optimization method that tracks how "R"esponses "EVOLVE" across iterations in LLM systems. By focusing on the evolution of responses over time, REVOLVE enables more stable and effective optimization by making thoughtful, progressive adjustments at each step. Experimental results demonstrate that REVOLVE outperforms competitive baselines, achieving a 7.8% improvement in prompt optimization, a 20.72% gain in solution refinement, and a 29.17% increase in code optimization. Additionally, REVOLVE converges in fewer iterations, resulting in significant computational savings. These advantages highlight its adaptability and efficiency, positioning REVOLVE as a valuable tool for optimizing LLM-based systems and accelerating the development of next-generation AI technologies.
Paperid:3109
Authors:Chang Cao · Han Li · Yulong Wang · Rui Wu · Hong Chen
Abstract:
While Graph Neural Networks (GNNs) have shown outstanding performance in node classification tasks, they are vulnerable to adversarial attacks, which are imperceptible changes to input samples. Adversarial training, as a widely used tool to enhance the adversarial robustness of GNNs, has presented remarkable effectiveness in node classification tasks. However, the generalization properties for explaining their behaviors remain not well understood from the theoretical viewpoint. To fill this gap, we develop a high probability generalization bound of general GNNs in adversarial learning through covering number analysis. We estimate the covering number of the GNN model class based on the entire perturbed feature matrix by constructing a cover for the perturbation set. Our results are generally applicable to a series of GNNs. We demonstrate their applicability by investigating the generalization performance of several popular GNN models under adversarial attacks, which reveal the architectural factors influencing the generalization gap. Our experimental results on benchmark datasets provide evidence that supports the established theoretical findings.
Paperid:3110
Authors:Siyuan Duan · Wenyuan Wu · Peng Hu · Zhenwen Ren · Dezhong Peng · Yuan Sun
Abstract:
Physicsinformed neural networks (PINNs) aim to constrain the outputs and gradients of deep learning models to satisfy specified governing physics equations, which have demonstrated significant potential for solving partial differential equations (PDEs). Although existing PINN methods have achieved pleasing performance, they always treat both easy and hard sample points indiscriminately, especially ones in the physical boundaries. This easily causes the PINN model to fall into undesirable local minima and unstable learning, thereby resulting in an Unbalanced Prediction Problem (UPP). To deal with this daunting problem, we propose a novel framework named Cognitive Physical Informed Neural Network (CoPINN) that imitates the human cognitive learning manner from easy to hard. Specifically, we first employ separable subnetworks to encode independent one-dimensional coordinates and apply an aggregation scheme to generate multi-dimensional predicted physical variables. Then, during the training phase, we dynamically evaluate the difficulty of each sample according to the gradient of the PDE residuals. Finally, we propose a cognitive training scheduler to progressively optimize the entire sampling regions from easy to hard, thereby embracing robustness and generalization against predicting physical boundary regions. Extensive experiments demonstrate that our CoPINN achieves state-of-the-art performance, particularly significantly reducing prediction errors in stubborn regions.
Paperid:3111
Authors:Xiner Li · Limei Wang · Youzhi Luo · Carl Edwards · Shurui Gui · Yuchao Lin · Heng Ji · Shuiwang Ji
Abstract:
We consider molecule generation in 3D space using language models (LMs), which requires discrete tokenization of 3D molecular geometries. Although tokenization of molecular graphs exists, that for 3D geometries is largely unexplored. Here, we attempt to bridge this gap by proposing a novel method which converts molecular geometries into SE(3)invariant 1D discrete sequences. Our method consists of canonical labeling and invariant spherical representation steps, which together maintain geometric and atomic fidelity in a format conducive to LMs. Our experiments show that, when coupled with our proposed method, various LMs excel in molecular geometry generation, especially in controlled generation tasks.
Paperid:3112
Authors:Jennifer Hsia · Afreen Shaikh · Zhiruo Wang · Graham Neubig
Abstract:
Retrievalaugmented generation (RAG) enhances language models by integrating external knowledge, but its effectiveness is highly dependent on system configuration. Improper retrieval settings can degrade performance, making RAG less reliable than closed-book generation. In this work, we introduce RAGGED, a framework for systematically evaluating RAG systems across diverse retriever-reader configurations, retrieval depths, and datasets. Our analysis reveals that reader robustness to noise is the key determinant of RAG stability and scalability. Some readers benefit from increased retrieval depth, while others degrade due to their sensitivity to distracting content. Through large-scale experiments on open-domain, multi-hop, and specialized-domain datasets, we show that retrievers, rerankers, and prompts influence performance but do not fundamentally alter these reader-driven trends. By providing a principled framework and new metrics to assess RAG stability and scalability, RAGGED enables systematic evaluation of retrieval-augmented generation systems, guiding future research on optimizing retrieval depth and model robustness.
Paperid:3113
Authors:Edward Chang
Abstract:
This paper introduces a threebranch checks-and-balances framework for ethical alignment of Large Language Models (LLMs), inspired by the idea of collaborative intelligence. It implements three independent yet interacting components: LLMs as the executive branch for knowledge generation, DIKE (the goddess of justice) as the legislative branch establishing ethical guardrails, and ERIS (the goddess of discord) as the judicial branch for contextual interpretation. The adversarial DIKE-ERIS duality enables adaptation to diverse cultural contexts while upholding consistent ethical principles. This architecture addresses limitations of reinforcement learning with human feedback (RLHF) by providing interpretable, adaptable, and culturally-aware ethical reasoning. Through self-supervised learning and adversarial testing, our framework demonstrates how emotional modeling can guide linguistic behaviors toward ethical outcomes while preserving independence across knowledge generation, ethical oversight, and contextual interpretation.
Paperid:3114
Authors:Sagnik Mukherjee · Abhinav Chinta · Takyoung Kim · Tarun Anoop Sharma · Dilek Hakkani-Tür
Abstract:
Chainof-Thought (CoT) prompting enhances mathematical reasoning in large language models (LLMs) by enabling detailed step-by-step solutions. However, due to the verbosity of LLMs, the resulting reasoning chains can be long, making it harder to verify the reasoning steps and trace issues resulting from dependencies between the steps that may be farther away in the sequence of steps. Importantly, mathematical reasoning allows each step to be derived from a small set of premises, which are a subset of the preceding steps in the reasoning chain. In this paper, we present a framework that identifies the premises for each step, to improve the evaluation of reasoning. We restructure conventional linear reasoning chains into Premise Augmented Reasoning Chains (PARC) by introducing premise links, resulting in a directed acyclic graph where the nodes are the steps and the edges are the premise links. Through experiments with a PARC-based dataset that we built, namely (Premises and ERrors identification in LLMs), we demonstrate that LLMs can reliably identify premises within complex reasoning chains. In particular, even open-source LLMs achieve 90\% recall in premise identification. We also show that PARC helps to identify errors in reasoning chains more reliably. The accuracy of error identification improves by 6\% to 16\% absolute when step-by-step verification is carried out in PARC under the premises.Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks and open new avenues for improving the reliability of LLM-based reasoning evaluations.
Paperid:3115
Authors:Wendong Bu · Yang Wu · Qifan Yu · Minghe Gao · Bingchen Miao · Zhenkui Zhang · Kaihang Pan · liyunfei · Mengze Li · Wei Ji · Juncheng Li · Siliang Tang · Yueting Zhuang
Abstract:
As multimodal large language models (MLLMs) advance, MLLMbased virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate. Training on our graph-structured data shows that it improves generalization across environments. We conduct multidimensional evaluations for virtual agents, revealing their performance across various capabilities and paving the way for future advancements. The code is available in https://anonymous.4open.science/r/OmniBench-E8FD.
Paperid:3116
Authors:Ruonan Yu · Songhua Liu · Zhenxiong Tan · Xinchao Wang
Abstract:
Textto-image diffusion models have achieved remarkable progress in recent years. However, generating high-resolution images remains challenging, particularly when training data and computational resources are limited. In this paper, we explore this practical problem from two key perspectives: data and parameter efficiency. For data efficiency, we theoretically and empirically demonstrate that synthetic data generated by teacher models can significantly promote training convergence. For parameter efficiency, we find that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable, offering substantial performance gains while maintaining efficiency. Additionally, for models leveraging guidance distillation, such as FLUX, we show that disabling classifier-free guidance, i.e., setting the guidance scale to 1 during adaptation, is crucial for satisfactory performance. Extensive experiments validate our approach, achieving comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 Pro Ultra with only 3K samples and 2K iterations, while setting new benchmarks for 4K-resolution generation.
Paperid:3117
Authors:He ZHANG · Ming Zhou · shaopeng zhai · Ying Sun · Hui Xiong
Abstract:
Unsupervised skill discovery aims to learn diverse and distinguishable behaviors in openended reinforcement learning.For the existing methods, they focus on improving the diversity via pure exploration, mutual information optimization and learning temporal representation. Despite they perform well on exploration, they remain limited in terms of efficiency, especially for the high-dimensional situations.In this work, we frame the skill discovery as a min-max game of skill generation and policy learning, proposing a regret-aware method on top of temporal representation learning that expands the discovered skill space along the direction of upgradable policy strength.The key insight behind the proposed method is that the skill discovery is adversarial to the policy learning, i.e., skills with weak strength should be further explored while less exploration for the skills with converged strength.As an implementation, we score the degree of strength convergence with regret, and guide the skill discovery with a learnable skill generator. To avoid degeneration, the skill generation comes from an upgradable population of skill generators.We conduct experiments on environments with varying complexities and dimension sizes.Empirical results show that our method outperforms baselines on both efficiency and diversity.Moreover, our method achieves 15\% zero-shot improvement on high-dimensional environments, compared to existing methods.
Paperid:3118
Authors:Utkarsh Saxena · Sayeh Sharify · Kaushik Roy · Xin Wang
Abstract:
Posttraining quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g.~8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33\% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3X speedup over 16-bit baseline. Anonymous code repository available at https://anonymous.4open.science/r/project-resq-2142.
Paperid:3119
Authors:Hongling Zheng · Li Shen · Yong Luo · Deheng Ye · Bo Du · Jialie SHEN · Dacheng Tao
Abstract:
The Conditional Sequence Modeling (CSM) paradigm, benefiting from the transformer's powerful distribution modeling capabilities, has demonstrated considerable promise in offline Reinforcement Learning (RL) tasks. Depending on the task's nature, it is crucial to carefully balance the interplay between inherent local features and longterm dependencies in Markov decision trajectories to mitigate potential performance degradation and unnecessary computational overhead. In this paper, we propose Decision Mixer (DM), which addresses the conflict between features of different scales in the modeling process from the perspective of dynamic integration. Drawing inspiration from conditional computation, we design a plug-and-play dynamic token selection mechanism to ensure the model can effectively allocate attention to different features based on task characteristics. Additionally, we employ an auxiliary predictor to alleviate the short-sightedness issue in the autoregressive sampling process. DM achieves state-of-the-art performance on various standard RL benchmarks while requiring significantly fewer computational resources, offering a viable solution for building efficient and scalable RL foundation models. Code is available at here.
Paperid:3120
Authors:Nick Bishop · Daniel Jarne Ornia · Joel Dyer · Anisoara Calinescu · Michael Wooldridge
Abstract:
Simulation modeling offers a flexible approach to constructing highfidelity synthetic representations of complex real-world systems. However, the increased complexity of such models introduces additional complications, for example when carrying out statistical inference procedures. This has motivated a large and growing literature onlikelihood-freeorsimulation-basedinference methods, which approximate (e.g., Bayesian) inference without assuming access to the simulator's intractable likelihood function. A hitherto neglected problem in the simulation-based Bayesian inference literature is the challenge of constructing uninformativereference priorsfor complex simulation models. Such priors maximise an expected Kullback-Leibler distance from the prior to the posterior, thereby influencing posterior inferences minimally and enabling an "objective" approach to Bayesian inference that do not necessitate the incorporation of strong subjective prior beliefs. In this paper, we propose and test a selection of likelihood-free methods for learning reference priors for simulation models, using variational approximations to these priors and a variety of mutual information estimators. Our experiments demonstrate that good approximations to reference priors for simulation models are in this way attainable, providing a first step towards the development of likelihood-free objective Bayesian inference procedures.
Paperid:3121
Authors:Xiaoqian Jiang · Jing Zhang
Abstract:
Many federated learning scenarios encounter label noises in the clientside datasets. The resulting degradation in global model performance raises the urgent need to address label noise. This paper proposes FedClean -- a novel general robust label noise correction for federated learning. FedClean first uses the local centralized noisy label learning to select clean samples to train a global model. Then, it employs a two-stage correction scheme to correct the noisy labels from two distinct perspectives of local noisy label learning and the global model. FedClean also proposes a novel model aggregation method, further reducing the impact of label noises. FedClean neither assumes the existence of clean clients nor the specific noise distributions, showing the maximum versatility. Extensive experimental results show that FedClean effectively identifies and rectifies label noises even if all clients exhibit label noises, which outperforms the state-of-the-art noise-label learning methods for federated learning.
Paperid:3122
Authors:Qinglin Zhu · Runcong Zhao · Hanqi Yan · Yulan He · Yudong Chen · Lin Gui
Abstract:
Large Language Models (LLMs) struggle with complex reasoning due to limited diversity and inefficient search. We propose an embeddingbased search framework that optimises the embedding of the first token to guide generation. It combines (1) Embedding perturbation for controlled exploration and (2) Bayesian optimisation to refine embeddings via a verifier-guided objective, balancing exploration and exploitation. This approach improves reasoning accuracy and coherence while avoiding reliance on heuristic search. Experiments demonstrate superior correctness with minimal computation, making it a scalable, model-agnostic solution.
Paperid:3123
Authors:Steve Hanneke · Amirreza Shaeiri
Abstract:
Abstract:List learning is an important topic in both theoretical and empirical machine learning research, playing a key role in the recent breakthrough result of (Brukhim et al., 2022) on the characterization of multiclass PAC learnability, as well as addressing label ambiguity in computer vision classification tasks, among others. In this paper, we study the problem of list transductive online learning. In this framework, the learner outputs a list of multiple labels for each instance rather than just one, as in traditional multiclass classification. In the realizable setting, we demonstrate a trichotomy of possible rates of the minimax number of mistakes. In particular, if the learner plays for $\text{T} \in \mathbb{N}$ rounds, its minimax number of mistakes can only be of the orders $\Theta(\text{T})$, $\Theta(\log \text{T})$, or $\Theta(1)$. This resolves an open question raised by (Hanneke et al., 2024). On the other hand, in the agnostic setting, we characterize the learnability by constructively proving the $\widetilde{\mathcal{O}}(\sqrt{\text{T}})$ upper bound on the minimax expected regret. Along the way, we also answer another open question asked by (Moran et al., 2023). To establish these results, we introduce two new combinatorial complexity dimensions, called the Levelconstrained $(\mathrm{L+1})$-Littlestone dimension and Level-constrained $(\mathrm{L+1})$-Branching dimension, if the list size is $\mathrm{L} \in \mathbb{N}$. Eventually, we conclude our work by raising an open question regarding eliminating the factor list size, which seems to be a crucial step, as it has consistently appeared in previous works on this subject.
Paperid:3124
Authors:Shuai Yi · Yixiong Zou · Yuhua Li · Ruixuan Li
Abstract:
Crossdomain few-shot learning (CDFSL) aims to transfer knowledge from a data-sufficient source domain to data-scarce target domains. Although Vision Transformer (ViT) has shown superior capability in many vision tasks, its transferability against huge domain gaps in CDFSL is still under-explored.In this paper, we find an intriguing phenomenon: during the source-domain training, prompt tuning, as a common way to train ViT, could be harmful for the generalization of ViT in target domains, but setting them to random noises (i.e., random registers) could consistently improve target-domain performance. We then delve into this phenomenon for an interpretation. We find that learnable prompts capture domain information during the training on the source dataset, which views irrelevant visual patterns as vital cues for recognition. This can be viewed as a kind of overfitting and increases the sharpness of the loss landscapes. In contrast, random registers are essentially a novel way of perturbing attention for the sharpness-aware minimization, which helps the model find a flattened minimum in loss landscapes, increasing the transferability. Based on this phenomenon and interpretation, we further propose a simple but effective approach for CDFSL to enhance the perturbation on attention maps by adding random registers on semantic regions of image tokens, improving the effectiveness and efficiency of random registers. Extensive experiments on four benchmarks validate our rationale and state-of-the-art performance. Our codes will be released.
Paperid:3125
Authors:Tony Shen · Seonghwan Seo · Ross Irwin · Kieran Didi · Simon Olsson · Woo Youn Kim · Martin Ester
Abstract:
Many generative applications, such as synthesisbased 3D molecular design, involve constructing compositional objects with continuous features.Here, we introduce Compositional Generative Flows (CGFlow), a novel framework that extends flow matching to generate objects in compositional steps while modeling continuous states. Our key insight is that modeling compositional state transitions can be formulated as a straightforward extension of the flow matching interpolation process.We further build upon the theoretical foundations of generative flow networks (GFlowNets), enabling reward-guided sampling of compositional structures. We apply CGFlow to synthesizable drug design by jointly designing the molecule's synthetic pathway with its 3D binding pose.Our approach achieves state-of-the-art binding affinity among synthesis-based baselines on all 15 targets from the LIT-PCBA benchmark. Evaluation on PoseCheck reveals that molecules designed by our method exhibit a greater number of key protein-ligand interactions, demonstrating the value of 3D molecule and synthetic pathway co-design.
Paperid:3126
Authors:Emilia Magnani · Marvin Pförtner · Tobias Weber · Philipp Hennig
Abstract:
Neural operators generalize neural networks to learn mappings between function spaces from data.They are commonly used to learn solution operators of parametric partial differential equations (PDEs), or propagators of timedependent PDEs.However, to make them useful in high-stakes simulation scenarios, their inherent predictive error must be quantified reliably.We introduce LUNO, a novel framework for approximate Bayesian uncertainty quantification in trained neural operators.Our approach leverages model linearization to push (Gaussian) weight-space uncertainty forward to the neural operator's predictions.We show that this can be interpreted as a probabilistic version of the concept of currying from functional programming, yielding a function-valued (Gaussian) random process belief.Our framework provides a practical yet theoretically sound way to apply existing Bayesian deep learning methods such as linearized Laplace approximation or SWAG to neural operators.Just as the underlying neural operator, our approach is resolution-agnostic by design.The method adds minimal prediction overhead, can be applied post-hoc without retraining the network, and scales to large models and datasets.We evaluate these aspects in a case study on Fourier neural operators.
Paperid:3127
Authors:Harshvardhan Agarwal · Sunita Sarawagi
Abstract:
Large language models (LLMs) have demonstrated the capability to perform incontext learning (ICL) for completely unseen tasks in classification or language completion. Sequence to sequence (seq2seq) is another popular task category with several applications seeking quick adaptation with ICL. We present a systematic analysis of the ICL capability of LLMs on Seq2Seq tasks using a formal structured language-pair. Our study reveals a critical limitation: except for very short input sequences, ICL fails to achieve consistent learning across all output positions. This exposes a fundamental weakness of modern LLMs — their inability to effectively uncover the alignment between input and output sequences. Consequently, this limitation results in incomplete induction heads, which are essential for in-context learning of new discrete mappings.To address these limitations, we propose ICA-Tune, a method for focused fine-tuning of an LLM using in-context examples. We present a mechanistic evaluation with two accuracy probes to show how input-output alignment emerges in middle layers of an LLM without direct supervision. This alignment leads to an abrupt jump in the completeness of the induction heads in higher layers. We show that, compared to standard fine-tuning, ICA-Tune enables more sample efficient learning and better generalization to OOD instances.
Paperid:3128
Authors:RUNZHAO YANG · Xiaolong Wu · Zhihong Zhang · Fabian Zhang · Tingxiong Xiao · Zongren Li · Kunlun He · Jinli Suo
Abstract:
Recent advancements in computer vision have seen Implicit Neural Representations (INR) becoming a dominant representation form for data due to their compactness and expressive power. To solve various vision tasks with INR data, vision networks can either be purely INRbased, but are thereby limited by simplistic operations and performance constraints, or include raster-based methods, which then tend to lose crucial semantic information of the INR during the conversion process. To address these issues, we propose DVI, a novel Derivative-based Vision network for INR, capable of handling a variety of vision tasks across various data modalities, while achieving the best performance among the existing methods by incorporating state of the art raster-based methods into a INR based architecture. DVI excels by extracting semantic information from the high order derivative map of the INR, then seamlessly fusing it into a pre-existing raster-based vision network, enhancing its performance with deeper, task-relevant semantic insights. Extensive experiments on five vision tasks across three data modalities demonstrate DVI's superiority over existing methods. Additionally, our study encompasses comprehensive ablation studies to affirm the efficacy of each element of DVI, the influence of different derivative computation techniques and the impact of derivative orders. Reproducible codes are provided in the supplementary materials.
Paperid:3129
Authors:Patrick Leask · Neel Nanda · Noura Al Moubayed
Abstract:
Sparse autoencoders (SAEs) are a popular method for decomposing Large Langage Models (LLM) activations into interpretable features. However, due to their substantial training cost, most academic research uses opensource SAEs which are only available for a restricted set of models of up to 27B parameters. We introduce Inference-Time Decomposition of Activations (ITDA) models, which can be trained in just 0.1-1\% of the time required for SAEs, using only 0.1-1\% of the data. Despite this, ITDAs achieve at least 90\% of the reconstruction performance of SAEs and deliver comparable results on interpretability benchmarks. Unlike SAEs, ITDAs also enable cross-model comparisons. Notably, our ITDA-based representation similarity index outperforms both SVCCA and CKA in an LLM layer matching task.
Paperid:3130
Authors:Ermis Soumalias · Jakob Heiss · Jakob Weissteiner · Sven Seuken
Abstract:
We study the design ofiterative combinatorial auctions (ICAs).The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, recent work has proposed machine learning (ML)based preference elicitation algorithms that aim to elicit only the most critical information from bidders to maximize efficiency.However, while the SOTA ML-based algorithms elicit bidders' preferences viavalue queries, ICAs that are used in practice elicit information viademand queries. In this paper, we introduce a novel ML algorithm that provably makes use of the full information from both value and demand queries, and we show via experiments that combining both query types results in significantly better learning performance in practice. Building on these insights, we present MLHCA, a new ML-powered auction that uses value and demand queries. MLHCA substantially outperforms the previous SOTA, reducing efficiency loss by up to a factor 10, with up to 58\% fewer queries. Thus, MLHCA achieves large efficiency improvements while also reducing bidders' cognitive load, establishing a new benchmark for both practicability and efficiency.
Paperid:3131
Authors:Mohammad Mozaffari · Amir Yazdanbakhsh · Maryam Mehri Dehnavi
Abstract:
Conventional model compression techniques for LLMs address high memory consumption and slow inference challenges but typically require computationally expensive retraining to preserve accuracy. In contrast, oneshot compression methods eliminate retraining cost, but struggle to achieve accuracy comparable to dense models. This paper presents SLIM, a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation into a unified process. First, we formulate the quantization process using a probabilistic approach (SLIM-Quant) that enables us to apply uniform quantization. Then, we use an existing one-shot pruning method to apply semi-structured sparsity on top of the quantized weights. Finally, to compensate for the introduced aggregated quantization and sparsity error, we use a novel saliency function with unique invertible and additive features that enables us tomathematically compute the value of low-rank adapters. SLIM improves model accuracy by up to 5.66% (LLaMA-2-7B) for 2:4 sparsity with 4-bit weight quantization, outperforming prior methods. Models compressed with SLIM achieve up to 3.78× and 3.75× on Nvidia RTX3060 and A100 GPUs, respectively. We also propose an optional PEFT recipe that further improves accuracyby up to 1.66% (LLaMA-2-13B) compared to SLIM without fine-tuning.
Paperid:3132
Authors:Mattia Opper · Siddharth N
Abstract:
We present Banyan, a model that efficiently learns semantic representations by leveraging explicit hierarchical structure. While transformers excel at scale, they struggle in lowresource settings. Conversely recent structured models have shown promise as efficient learners, but lack performance. Banyan bridges this gap with two key innovations: an entangled hierarchical tree structure and diagonalized message passing, enabling it to outperform larger transformer models with just 14 non-embedding parameters. It excels in low-resource settings, offering a viable alternative for under-represented languages and highlighting its potential for efficient, interpretable NLP in resource-constrained environments.
Paperid:3133
Authors:Kyle Richardson · Vivek Srikumar · Ashish Sabharwal
Abstract:
Recent direct preference alignment algorithms (DPA), such as DPO, have shown great promise in aligning large language models to human preferences. While this has motivated the development of many new variants of the original DPO loss, understanding the differences between these recent proposals, as well as developing new DPA loss functions, remains difficult given the lack of a technical and conceptual framework for reasoning about the underlying semantics of these algorithms. In this paper, we attempt to remedy this by formalizing DPA losses in terms of discrete reasoning problems. Specifically, we ask: Given an existing DPA loss, can we systematically derive a symbolic program that characterizes its semantics? We propose a novel formalism for characterizing preference losses for single model and reference model based approaches, and identify symbolic forms for a number of commonly used DPA variants. Further, we show how this formal view of preference learning sheds new light on both the size and structure of the DPA loss landscape, making it possible to not only rigorously characterize the relationships between recent loss proposals but also to systematically explore the landscape and derive new loss functions from first principles. We hope our framework and findings will help provide useful guidance to those working on human AI alignment.
Paperid:3134
Authors:Zixiang Ai · Zichen Liu · Yuanhang Lei · Zhenyu Cui · Xu Zou · Jiahuan Zhou
Abstract:
Pretrained 3D vision models have gained significant attention for their promising performance on point cloud data. However, fully fine-tuning these models for downstream tasks is computationally expensive and storage-intensive. Existing parameter-efficient fine-tuning (PEFT) approaches, which focus primarily on input token prompting, struggle to achieve competitive performance due to their limited ability to capture the geometric information inherent in point clouds. To address this challenge, we propose a novel Geometry-Aware Point Cloud Prompt (GAPrompt) that leverages geometric cues to enhance the adaptability of 3D vision models. First, we introduce a Point Prompt that serves as an auxiliary input alongside the original point cloud, explicitly guiding the model to capture fine-grained geometric details. Additionally, we present a Point Shift Prompter designed to extract global shape information from the point cloud, enabling instance-specific geometric adjustments at the input level. Moreover, our proposed Prompt Propagation mechanism incorporates the shape information into the model's feature extraction process, further strengthening its ability to capture essential geometric characteristics. Extensive experiments demonstrate that GAPrompt significantly outperforms state-of-the-art PEFT methods and achieves competitive results compared to full fine-tuning on various benchmarks, while utilizing only 2.19\% of trainable parameters.
Paperid:3135
Authors:Qifang Zhao · Weidong Ren · Tianyu Li · Hong Liu · Xingsheng He · Xiaoxiao Xu
Abstract:
We introduceGraphGPT, a novel selfsupervisedgenerative pre-trainedmodel for graph learningbased on theGraph Eulerian Transformer(GET).First, we proposeGET, which combines a stan-dard transformer encoder or decoder architecturewith an innovative graph-to-sequence transforma-tion method. This method converts graphs or sam-pled subgraphs into sequences of tokens repre-senting nodes, edges, and attributes in a reversiblemanner using Eulerian paths. We pre-trainGETusing either of the two self-supervised tasks: next-token prediction (NTP) and scheduled masked-token prediction (SMTP). The pre-trained modelis then fine-tuned for downstream tasks such asgraph-, edge-, and node-level prediction. Despiteits simplicity, GraphGPT achieves performancecomparable to or surpassing state-of-the-art meth-ods on multiple large-scale Open Graph Bench-mark (OGB) datasets. It demonstrates excep-tional results on the molecular property predictiondataset PCQM4Mv2 and the protein-protein inter-action dataset ogbl-ppa. Notably, generative pre-training enables scaling GraphGPT to 2 billionparameters while maintaining performance gains— a breakthrough that overcomes the scalabilitylimitations of traditional Graph Neural Networks(GNNs) and prior graph transformers (GTs). Toadvance research in graph foundation models andfacilitate scientific discovery in chemistry, materi-als science, and related fields, we will release thesource code and pre-trained checkpoints.
Paperid:3136
Authors:Rajiv Movva · Kenny Peng · Nikhil Garg · Jon Kleinberg · Emma Pierson
Abstract:
We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 12 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.
Paperid:3137
Authors:Adrien Cortes · Remi Rehm · Victor Letzelter
Abstract:
Abstract:We introduce $\texttt{TimeMCL}$, a method leveraging the Multiple Choice Learning (MCL) paradigm to forecast multiple plausible time series futures. Our approach employs a neural network with multiple heads and utilizes the WinnerTakes-All (WTA) loss to promote diversity among predictions. MCL has recently gained attention due to its simplicity and ability to address ill-posed and ambiguous tasks. We propose an adaptation of this framework for time-series forecasting, presenting it as an efficient method to predict diverse futures, which we relate to its implicit *quantization* objective. We provide insights into our approach using synthetic data and evaluate it on real-world time series, demonstrating its promising performance at a light computational cost.
Paperid:3138
Authors:Jack Cai · Ammar Vora · Randolph Zhang · Mark O'Connor · Mark Jeffrey
Abstract:
As Large Language Models (LLMs) grow in size and supported context length, efficient inference strategies are essential to maintain lowlatency token generation. Unfortunately, the traditional approaches of tensor and data parallelism face diminishing returns when scaling across multiple devices. We propose a novel form---attention-level speculative parallelism (ALSpec) -- that predicts self-attention outputs to execute subsequent operations early on separate devices. Our approach overlaps attention and non-attention computations, reducing the attention latency overhead at 128K context length by up to 5x and improving end-to-end decode latency by up to 1.65x, all without sacrificing quality. We establish the fundamental pillars for speculative execution and provide an execution paradigm that simplifies implementation. We show that existing attention-approximation methods perform well on simple information retrieval tasks, but they fail in advanced reasoning and math. Combined with speculative execution, we can approximate up to 90% of self-attention without harming model correctness. Demonstrated on Tenstorrent's NPU devices, we scale up LLM inference beyond current techniques, paving the way for faster inference in transformer-based models.
Paperid:3139
Authors:Semyon Savkin · Eitan Porat · Or Ordentlich · Yury Polyanskiy
Abstract:
Posttraining quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Ordentlich and Polyanskiy have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on the Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than a 55\% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to the state-of-the-art method SpinQuant (perplexity of 7.3). Comparisons on various LLM evaluation benchmarks also show a reduction in performance degradation induced by quantization.
Paperid:3140
Authors:Megan Tjandrasuwita · Chanakya Ekbote · Liu Ziyin · Paul Pu Liang
Abstract:
Multimodal representation learning is fundamentally about transforming incomparable modalities into comparable representations. While prior research has primarily focused onexplicitlyaligning these representations through targeted learning objectives and model architectures, a recent line of work has found that independently trained unimodal models of increasing scale and performance can becomeimplicitlyaligned with each other. These findings raise fundamental questions regarding the emergence of aligned representations in multimodal learning. Specifically: (1) when and why does alignment emerge implicitly? and (2) is alignment a reliable indicator of performance? Through a comprehensive empirical investigation, we demonstrate that both the emergence of alignment and its relationship with task performance depend on several critical data characteristics. These include, but are not necessarily limited to, the degree of similarity between the modalities and the balance between redundant and unique information they provide for the task. Our findings suggest that alignment may not be universally beneficial; rather, its impact on performance varies depending on the dataset and task. These insights can help practitioners determine whether increasing alignment between modalities is advantageous or, in some cases, detrimental to achieving optimal performance.
Paperid:3141
Authors:Gefan Yang · Frank van der Meulen · Stefan Sommer
Abstract:
We propose a novel method for simulating conditioned diffusion processes (diffusion bridges) in Euclidean spaces. By training a neural network to approximate bridge dynamics, our approach eliminates the need for computationally intensive Markov Chain Monte Carlo (MCMC) methods or reverseprocess modeling. Compared to existing methods, it offers greater robustness across various diffusion specifications and conditioning scenarios. This applies in particular to rare events and multimodal distributions, which pose challenges for score-learning- and MCMC-based approaches. We propose a flexible variational family for approximating the diffusion bridge path measure which is partially specified by a neural network. Once trained, it enables efficient independent sampling at a cost comparable to sampling the unconditioned (forward) process.
Paperid:3142
Authors:Junyan Liu · ARNAB MAITI · Artin Tajdini · Kevin Jamieson · Lillian Ratliff
Abstract:
Abstract:We initiate the study of a repeated principalagent problem over a finite horizon $T$, where a principal sequentially interacts with $K\geq 2$ types of agents arriving in an *adversarial* order. At each round, the principal strategically chooses one of the $N$ arms to incentivize for an arriving agent of *unknown type*. The agent then chooses an arm based on its own utility and the provided incentive, and the principal receives a corresponding reward. The objective is to minimize regret against the best incentive in hindsight. Without prior knowledge of agent behavior, we show that the problem becomes intractable, leading to linear regret. We analyze two key settings where sublinear regret is achievable. In the first setting, the principal knows the arm each agent type would select greedily for any given incentive. Under this setting, we propose an algorithm that achieves a regret bound of $\mathcal{O}(\min(\sqrt{KT\log N},K\sqrt{T}))$ and provide a matching lower bound up to a $\log K$ factor. In the second setting, an agent's response varies smoothly with the incentive and is governed by a Lipschitz constant $L$. Under this setting, we show that there is an algorithm with a regret bound of $\tilde{\mathcal{O}}((LN)^{1/3}T^{2/3})$ and establish a matching lower bound up to logarithmic factors. Finally, we extend our algorithmic results for both settings by allowing the principal to incentivize multiple arms simultaneously in each round.
Paperid:3143
Authors:Xiao Huang · Xu Liu · Enze Zhang · Tong Yu · Shuai Li
Abstract:
Offlineto-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation. However, generated data still exhibits a gap with the online data, limiting overall performance. To address this, we propose a new data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Without introducing additional classifier training overhead, CFDG leverages classifier-free guidance diffusion to significantly enhance the generation quality of offline and online data with different distributions. Additionally, it employs a reweighting method to enable more generated data to align with the online data, enhancing performance while maintaining the agent's stability. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15\% average improvement in empirical performance on the D4RL benchmark such as MuJoCo and AntMaze.
Paperid:3144
Authors:Junjie Yao · zhongwang zhang · Zhi-Qin John Xu
Abstract:
Transformerbased Large Language Models (LLMs) have revolutionized Natural Language Processing by demonstrating exceptional performance across diverse tasks. This study investigates the impact of the parameter initialization scale on the training behavior and task preferences of LLMs. We discover that smaller initialization scales encourage models to favor reasoning tasks, whereas larger initialization scales lead to a preference for memorization tasks. We validate this reasoning bias via real datasets and meticulously designed anchor functions. Further analysis of initial training dynamics suggests that specific model components, particularly the embedding space and self-attention mechanisms, play pivotal roles in shaping these learning biases. We provide a theoretical framework from the perspective of model training dynamics to explain these phenomena. Additionally, experiments on real-world language tasks corroborate our theoretical insights. This work enhances our understanding of how initialization strategies influence LLM performance on reasoning tasks and offers valuable guidelines for training models.
Paperid:3145
Authors:Bonan Li · Yinhan Hu · Songhua Liu · Xinchao Wang
Abstract:
Layoutto-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies—Non-local Attention Energy Function and Adaptive Update—that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute attention scores, facilitating objects to better conform to the specified spatial conditions. On the other hand, we identify that the vanilla backpropagation update rule can cause deviations from the pre-trained domain, leading to out-of-distribution artifacts. We accordingly introduce a Langevin dynamics-based adaptive update scheme as a remedy that promotes in-domain updating while respecting layout constraints. Extensive experiments demonstrate that WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods. Codes will be released.
Paperid:3146
Authors:Runtian Zhai · Kai Yang · Burak VARICI · Che-Ping Tsai · Zico Kolter · Pradeep Ravikumar
Abstract:
Despite the empirical success of foundation models, we do not have a systematic characterization of the representations that these models learn. We show that a large class of representation learning methods are characterized by the relationship between the inputs and a context variable. Specifically, we show that many popular representation learning methods aim to extract the topd singular functions of an operator induced by the context, in which case we say that the representation learns the contexture.We demonstrate the generality of our framework by proving that representation learning within a number of learning paradigms --- supervised, self-supervised, and manifold learning --- can be characterized via contextures. One important consequence of our analytical framework is that once we can approximate the optimal contextures (the top singular functions of the contexture operator), scaling up the model size yields diminishing returns.Consequently, we posit that further improvement requires creating more useful contexts from data in a principled way, and such contexts can lead to better encoders. To this end, we study how to efficiently evaluate the usefulness of a context without knowing the downstream tasks. We propose a metric and show by experiments that it correlates well with the actual performance of the encoder on many real datasets.
Paperid:3147
Authors:Tomohiro Shiraishi · Tatsuya Matsukawa · Shuichi Nishino · Ichiro Takeuchi
Abstract:
A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating various analysis algorithms. In this paper, we propose a novel statistical test to assess the significance of data analysis pipelines. Our approach enables the systematic development of valid statistical tests applicable to any feature selection pipeline composed of predefined components. We develop this framework based on selective inference, a statistical technique that has recently gained attention for datadriven hypotheses. As a proof of concept, we focus on feature selection pipelines for linear models, composed of three missing value imputation algorithms, three outlier detection algorithms, and three feature selection algorithms. We theoretically prove that our statistical test can control the probability of false positive feature selection at any desired level, and demonstrate its validity and effectiveness through experiments on synthetic and real data. Additionally, we present an implementation framework that facilitates testing across any configuration of these feature selection pipelines without extra implementation costs.
Paperid:3148
Authors:Francis Bach · Saeed Saremi
Abstract:
Gaussian smoothing combined with a probabilistic framework for denoising via the empirical Bayes formalism, i.e., the TweedieMiyasawa formula (TMF), are the two key ingredients in the success of score-based generative models in Euclidean spaces. Smoothing holds the key for easing the problem of learning and sampling in high dimensions, denoising is needed for recovering the original signal, and TMF ties these together via the score function of noisy data. In this work, we extend this paradigm to the problem of learning and sampling the distribution of binary data on the Boolean hypercube by adopting Bernoulli noise, instead of Gaussian noise, as a smoothing device. We first derive a TMF-like expression for the optimal denoiser for the Hamming loss, where a score function naturally appears. Sampling noisy binary data is then achieved using a Langevin-like sampler which we theoretically analyze for different noise levels. At high Bernoulli noise levels sampling becomes easy, akin to log-concave sampling in Euclidean spaces. In addition, we extend the sequential multi-measurement sampling of Saremi et al. (2024) to the binary setting where we can bring the "effective noise" down by sampling multiple noisy measurements at a fixed noise level, without the need for continuous-time stochastic processes. We validate our formalism and theoretical findings by experiments on synthetic data and binarized images.
Paperid:3149
Authors:Zeyu Jia · Alexander Rakhlin · Tengyang Xie
Abstract:
Process and outcome supervision represent two fundamental approaches to reinforcement learning, especially for complex reasoning tasks in large language models. While process supervision offers intuitive advantages for longterm credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data.In this paper, we provide a possible theoretical resolution to this debate. Perhaps surprisingly, our main theorem shows that:under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision. At the core of this result lies the novelChange of Trajectory Measure Lemma---a powerful technical tool that bridges return-based trajectory measure and step-level distribution shift. Furthermore, for settings with access to a verifier or a rollout capability, we prove that any policy's advantage function can serve as an optimal process reward model, providing a simple yet powerful connection between outcome and process supervision. These findings suggest that the empirically observed performance gap between outcome and process supervision likely stems from algorithmic limitations rather than inherent statistical difficulties, potentially transforming how we approach data and algorithm design for reinforcement learning.
Paperid:3150
Authors:Yutong He · Pengrui Li · Yipeng Hu · Chuyan Chen · Kun Yuan
Abstract:
Subspace optimization algorithms, with GaLore \citep{zhao2024galore} as a representative method, have gained popularity for pretraining or fine-tuning large language models (LLMs) due to their memory efficiency. However, their convergence guarantees remain unclear, particularly in stochastic settings. In this paper, we unexpectedly discover that GaLore does not always converge to the optimal solution and substantiate this finding with an explicit counter-example. We then investigate the conditions under which GaLore can achieve convergence, demonstrating that it does so either in deterministic scenarios or when using a sufficiently large mini-batch size. More significantly, we introduceGoLore(Gradient randomLow-rank projection), a novel variant of GaLore that provably converges in stochastic settings, even with standard batch sizes. Our convergence analysis can be readily extended to other sparse subspace optimization algorithms. Finally, we conduct numerical experiments to validate our theoretical results and empirically explore the proposed mechanisms.
Paperid:3151
Authors:Se Jin Park · Julian Salazar · Aren Jansen · Keisuke Kinoshita · Yong Man Ro · RJ Skerry-Ryan
Abstract:
We consider the generative modeling of speech over multiple minutes, a requirement for longform multimedia generation and audio-native voice assistants. However, current spoken language models struggle to generate plausible speech past tens of seconds, from high temporal resolution of speech tokens causing loss of coherence, to architectural issues with long-sequence training or extrapolation, to memory costs at inference time. With these considerations we propose SpeechSSM, the first speech language model to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates, based on recent advances in linear-time sequence modeling. Furthermore, to address growing challenges in spoken language evaluation, especially in this new long-form setting, we propose: new embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long. Speech samples and the dataset are included.
Paperid:3152
Authors:Weihuang Zheng · Jiashuo Liu · Jiaxing Li · Jiayun Wu · Peng Cui · Youyong Kong
Abstract:
Graph Neural Networks (GNNs) are widely used for node classification tasks but often fail to generalize when training and test nodes come from different distributions, limiting their practicality. To address this challenge, recent approaches have adopted invariant learning and sample reweighting techniques from the outof-distribution (OOD) generalization field. However, invariant learning-based methods face difficulties when applied to graph data, as they rely on the impractical assumption of obtaining real environment labels and strict invariance, which may not hold in real-world graph structures. Moreover, current sample reweighting methods tend to overlook topological information, potentially leading to suboptimal results. In this work, we introduce the Topology-Aware Dynamic Reweighting (TAR) framework to address distribution shifts by leveraging the inherent graph structure. TAR dynamically adjusts sample weights through gradient flow on the graph edges during training. Instead of relying on strict invariance assumptions, we theoretically prove that our method is able to provide distributional robustness, thereby enhancing the out-of-distribution generalization performance on graph data. Our framework's superiority is demonstrated through standard testing on extensive node classification OOD datasets, exhibiting marked improvements over existing methods.
Paperid:3153
Authors:Yucheng Yang · Tianyi Zhou · Meng Fang · Mykola Pechenizkiy
Abstract:
Practical reinforcement learning (RL) usually requires agents to be optimized for multiple potentially conflicting criteria, e.g. speed vs. safety. Although MultiObjective RL (MORL) algorithms have been studied in previous works, their trained agents often cover limited Pareto optimal solutions and they lack precise controllability of the delicate trade-off among multiple objectives. Hence, the resulting agent is not versatile in aligning with customized requests from different users. To bridge the gap, we develop the ``Preference controllable (PC) RL'' framework, which trains a preference-conditioned meta-policy that takes user preference as input controlling the generated trajectories within the preference region on the Pareto frontier. The PCRL framework is compatible with advanced Multi-Objective Optimization~(MOO) algorithms that are rarely seen in previous MORL approaches. We also proposed a novel preference-regularized MOO algorithm specifically for PCRL. We provide a comprehensive theoretical analysis to justify its convergence and preference controllability.We evaluate PCRL with different MOO algorithms against state-of-the-art MORL baselines in various challenging environments with up to six objectives. In these experiments, our proposed method exhibits significantly better controllability than existing approaches and can generate Pareto solutions with better diversity and utilities.
Paperid:3154
Authors:Yifan Hu · Guibin Zhang · Peiyuan Liu · Disen Lan · Naiqi Li · Dawei Cheng · Tao Dai · Shutao Xia · Shirui Pan
Abstract:
Time series forecasting methods generally fall into two main categories: Channel Independent (CI) and Channel Dependent (CD) strategies. While CI overlooks important covariate relationships, CD captures all dependencies without distinction, introducing noise and reducing generalization. Recent advances in Channel Clustering (CC) aim to refine dependency modeling by grouping channels with similar characteristics and applying tailored modeling techniques. However, coarsegrained clustering struggles to capture complex, time-varying interactions effectively. To address these challenges, we propose TimeFilter, a GNN-based framework for adaptive and fine-grained dependency modeling. After constructing the graph from the input sequence, TimeFilter refines the learned spatial-temporal dependencies by filtering out irrelevant correlations while preserving the most critical ones in a patch-specific manner. Extensive experiments on 13 real-world datasets from diverse application domains demonstrate the state-of-the-art performance of TimeFilter. The code is available at \url{https://anonymous.4open.science/r/TimeFilter-C08C}.
Paperid:3155
Authors:Max van Spengler · Pascal Mettes
Abstract:
Embedding treelike data, from hierarchies to ontologies and taxonomies, forms a well-studied problem for representing knowledge across many domains. Hyperbolic geometry provides a natural solution for embedding trees, with vastly superior performance over Euclidean embeddings. Recent literature has shown that hyperbolic tree embeddings can even be placed on top of neural networks for hierarchical knowledge integration in deep learning settings. For all applications, a faithful embedding of trees is needed, with combinatorial constructions emerging as the most effective direction. This paper identifies and solves two key limitations of existing works. First, the combinatorial construction hinges on finding highly separated points on a hypersphere, a notoriously difficult problem. Current approaches achieve poor separation, degrading the quality of the corresponding hyperbolic embedding. We propose highly separated Delaunay tree embeddings (HS-DTE), which integrates angular separation in a generalized formulation of Delaunay embeddings, leading to lower embedding distortion. Second, low-distortion requires additional precision. The current approach for increasing precision is to use multiple precision arithmetic, which renders the embeddings useless on GPUs in deep learning settings. We reformulate the combinatorial construction using floating point expansion arithmetic, leading to superior embedding quality while retaining utility on accelerated hardware.
Paperid:3156
Authors:Etienne Gauthier · Francis Bach · Michael Jordan
Abstract:
As platforms increasingly rely on learning algorithms, collectives may form and seek ways to influence these platforms to align with their own interests. This can be achieved by coordinated submission of altered data. To evaluate the potential impact of such behavior, it is essential to understand the computations that collectives must perform to impact platforms in this way. In particular, collectives need to make a priori assessments of the effect of the collective before taking action, as they may face potential risks when modifying their data. Moreover they need to develop implementable coordination algorithms based on quantities that can be inferred from observed data. We develop a framework that provides a theoretical and algorithmic treatment of these issues and present experimental results in a product evaluation domain.
Paperid:3157
Authors:Wenhao Wang · Yifan Sun · Zongxin Yang · Zhentao Tan · Zhengdong Hu · Yi Yang
Abstract:
Abstract:Textguided image-to-image diffusion models excel in translating images based on textual prompts, allowing for precise and creative visual modifications. However, such a powerful technique can be misused for *spreading misinformation*, *infringing on copyrights*, and *evading content tracing*. This motivates us to introduce the task of origin **ID**entification for text-guided **I**mage-to-image **D**iffusion models (**ID$\mathbf{^2}$**), aiming to retrieve the original image of a given translated query. A straightforward solution to ID$^2$ involves training a specialized deep embedding model to extract and compare features from both query and reference images. However, due to *visual discrepancy* across generations produced by different diffusion models, this similarity-based approach fails when training on images from one model and testing on those from another, limiting its effectiveness in real-world applications. To solve this challenge of the proposed ID$^2$ task, we contribute the first dataset and a theoretically guaranteed method, both emphasizing generalizability. The curated dataset, **OriPID**, contains abundant **Ori**gins and guided **P**rompts, which can be used to train and test potential **ID**entification models across various diffusion models. In the method section, we first prove the *existence* of a linear transformation that minimizes the distance between the pre-trained Variational Autoencoder embeddings of generated samples and their origins. Subsequently, it is demonstrated that such a simple linear transformation can be *generalized* across different diffusion models. Experimental results show that the proposed method achieves satisfying generalization performance, significantly surpassing similarity-based methods (+31.6% mAP), even those with generalization designs.
Paperid:3158
Authors:Runquan Gui · Zhihai Wang · Jie Wang · Chi Ma · Huiling Zhen · Mingxuan Yuan · Jianye Hao · Defu Lian · Enhong Chen · Feng Wu
Abstract:
Abstract:Recent advancements have significantly enhanced the performance of large language models (LLMs) in tackling complex reasoning tasks, achieving notable success in domains like mathematical and logical reasoning.However, these methods encounter challenges with complex planning tasks, primarily due to extended reasoning steps, diverse constraints, and the challenge of handling multiple distinct subtasks.To address these challenges, we propose HyperTree Planning (HTP), a novel reasoning paradigm that constructs hypertree-structured planning outlines for effective planning.The hypertree structure enables LLMs to engage in hierarchical thinking by flexibly employing the divide-and-conquer strategy, effectively breaking down intricate reasoning steps, accommodating diverse constraints, and managing multiple distinct sub-tasks in a well-organized manner.We further introduce an autonomous planning framework that completes the planning process by iteratively refining and expanding the hypertree-structured planning outlines.Experiments demonstrate the effectiveness of HTP, achieving state-of-the-art accuracy on the TravelPlanner benchmark with Gemini-1.5-Pro, resulting in a 3.6$\times$ performance improvement over o1-preview.
Paperid:3159
Authors:Xu Wang · Yan Hu · Wenyu Du · Reynold Cheng · Benyou Wang · Difan Zou
Abstract:
Finetuning significantly improves the performance of Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. This paper aims to provide an in-depth interpretation of the fine-tuning process through circuit analysis, a popular tool inMechanistic Interpretability (MI). Unlike previous studies (Prakash et al. 2024, Chhabra et al. 2024) that focus on tasks where pre-trained models already perform well, we develop a set of mathematical tasks where fine-tuning yields substantial performance gains, bringing the setup closer to real-world scenarios. In our experiments, we identify circuits at various checkpoints during fine-tuning and examine the interplay between circuit analysis, fine-tuning methods, and task complexities. First, we find that while circuits maintain high node similarity before and after fine-tuning, their edges undergo significant changes, contrasting with previous work (Prakash et al. 2024, Chhabra et al. 2024) that reported only small circuit additions after fine-tuning. Based on these observations, we develop a circuit-awareLow-Rank Adaptation (LoRA)method that assigns ranks to layers according to edge changes in the circuits. Experimental results demonstrate that our circuit-based LoRA achieves an average improvement of 2.46% over standard LoRA with comparable parameter sizes. Furthermore, we explore how combining circuits from subtasks can enhance fine-tuning in compositional tasks, offering new insights into task design and deepening our understanding of circuit dynamics and fine-tuning mechanisms.
Paperid:3160
Authors:Biswadeep Chakraborty · Harshit Kumar · Saibal Mukhopadhyay
Abstract:
Graph Neural Networks (GNNs) face a critical limitation known as oversmoothing, where increasing network depth leads to homogenized node representations, severely compromising their expressiveness. We present a novel dynamical systems perspective on this challenge, revealing oversmoothing as an emergent property of GNNs' convergence to lowdimensional attractor states. Based on this insight, we introduceDYNAMO-GAT, which combines noise-driven covariance analysis with Anti-Hebbian learning to dynamically prune attention weights, effectively preserving distinct attractor states. We provide theoretical guarantees for DYNAMO-GAT's effectiveness and demonstrate its superior performance on benchmark datasets, consistently outperforming existing methods while requiring fewer computational resources. This work establishes a fundamental connection between dynamical systems theory and GNN behavior, providing both theoretical insights and practical solutions for deep graph learning.
Paperid:3161
Authors:Shuo Wang · Bokui Wang · Zhixiang Shen · Boyan Deng · zhao kang
Abstract:
Recent advances in CV and NLP have inspired researchers to develop generalpurpose graph foundation models through pre-training across diverse domains. However, a fundamental challenge arises from the substantial differences in graph topologies across domains. Additionally, real-world graphs are often sparse and prone to noisy connections and adversarial attacks. To address these issues, we propose the Multi-Domain Graph Foundation Model (MDGFM), a unified framework that aligns and leverages cross-domain topological information to facilitate robust knowledge transfer. MDGFM bridges different domains by adaptively balancing features and topology while refining original graphs to eliminate noise and align topological structures. To further enhance knowledge transfer, we introduce an efficient prompt-tuning approach. By aligning topologies, MDGFM not only improves multi-domain pre-training but also enables robust knowledge transfer to unseen domains. Theoretical analyses provide guarantees of MDGFM's effectiveness and domain generalization capabilities. Extensive experiments on both homophilic and heterophilic graph datasets validate the robustness and efficacy of our method.
Paperid:3162
Authors:Arman Sharifi Kolarijani · Tolga Ok · Peyman Mohajerin Esfahani · Mohamad Amin Sharifi Kolarijani
Abstract:
In this paper, we provide a novel algorithm for solving planning and learning problems of Markov decision processes. The proposed algorithm follows a policy iterationtype update by using a rank-one approximation of the transition probability matrix in the policy evaluation step. This rank-one approximation is closely related to the stationary distribution of the corresponding transition probability matrix, which is approximated using the power method. We provide theoretical guarantees for the convergence of the proposed algorithm to optimal (action-)value function with the same rate and computational complexity as the value iteration algorithm in the planning problem and as the Q-learning algorithm in the learning problem. Through our extensive numerical simulations, we show that the proposed algorithm consistently outperforms first-order algorithms and their accelerated versions for both planning and learning problems.
Paperid:3163
Authors:Mingde Zhao · Tristan Sylvain · Romain Laroche · Doina Precup · Yoshua Bengio
Abstract:
Generative models can be used in planning to propose targets corresponding to states or observations that agents deem either likely or advantageous to experience. However, agents can struggle with hallucinated, infeasible targets proposed by the models, leading to delusional planning behaviors, which raises safety concerns. Drawing inspiration from the human brain, we propose to reject these hallucinated targets with an addon target evaluator. Without proper training, however, the evaluator can produce delusional estimates, rendering it futile. We propose to address this via a combination of learning rule, architecture, and two novel hindsight relabeling strategies, which leads to correct evaluations of infeasible targets. Our experiments confirm that our approach significantly reduces delusional behaviors and enhances the performance of planning agents.
Paperid:3164
Authors:Jiachang Liu · Soroosh Shafiee · Andrea Lodi
Abstract:
Abstract:This paper investigates the problem of certifying optimality for sparse generalized linear models (GLMs), where sparsity is enforced through an $\ell_0$ cardinality constraint. While branchand-bound (BnB) frameworks can certify optimality by pruning nodes using dual bounds, existing methods for computing these bounds are either computationally intensive or exhibit slow convergence, limiting their scalability to large-scale problems. To address this challenge, we propose a first-order proximal gradient algorithm designed to solve the perspective relaxation of the problem within a BnB framework. Specifically, we formulate the relaxed problem as a composite optimization problem and demonstrate that the proximal operator of the non-smooth component can be computed in log-linear time complexity, eliminating the need to solve a computationally expensive second-order cone program. Furthermore, we introduce a simple restart strategy that enhances convergence speed while maintaining low per-iteration complexity. Extensive experiments on synthetic and real-world datasets show that our approach significantly accelerates dual bound computations and is highly effective in providing optimality certificates for large-scale problems.
Paperid:3165
Authors:Koen Minartz · Tim d'Hondt · Leon Hillmann · Jörn Starruß · Lutz Brusch · Vlado Menkovski
Abstract:
The cellular Potts model (CPM) is a powerful computational method for simulating collective spatiotemporal dynamics of biological cells.To drive the dynamics, CPMs rely on physicsinspired Hamiltonians. However, as first principles remain elusive in biology, these Hamiltonians only approximate the full complexity of real multicellular systems.To address this limitation, we propose NeuralCPM, a more expressive cellular Potts model that can be trained directly on observational data.At the core of NeuralCPM lies the Neural Hamiltonian, a neural network architecture that respects universal symmetries in collective cellular dynamics.Moreover, this approach enables seamless integration of domain knowledge by combining known biological mechanisms and the expressive Neural Hamiltonian into a hybrid model.Our evaluation with synthetic and real-world multicellular systems demonstrates that NeuralCPM is able to model cellular dynamics that cannot be accounted for by traditional analytical Hamiltonians.
Paperid:3166
Authors:Sida Li · Nikolaos Ignatiadis
Abstract:
Abstract:PredictionPowered Inference (PPI) is a powerful framework for enhancing statistical estimates by combining limited gold-standard data with machine learning (ML) predictions. While prior work has demonstrated PPI’s benefits for individual statistical tasks, modern applications require answering numerous parallel statistical questions. We introduce Prediction-Powered Adaptive Shrinkage ($\texttt{PAS}$), a method that bridges PPI with empirical Bayes shrinkage to improve estimation of multiple means. $\texttt{PAS}$ debiases noisy ML predictions $\textit{within}$ each task and then borrows strength $\textit{across}$ tasks by using those same predictions as a reference point for shrinkage. The amount of shrinkage is determined by minimizing an unbiased estimate of risk, and we prove that this tuning strategy is asymptotically optimal. Experiments on both synthetic and real-world datasets show that $\texttt{PAS}$ adapts to the reliability of the ML predictions and outperforms traditional and modern baselines in large-scale applications.
Paperid:3167
Authors:Peter Eckmann · Dongxia Wu · Germano Heinzelmann · Michael Gilson · Rose Yu
Abstract:
Current generative models for drug discovery primarily use molecular docking as an oracle to guide the generation of active compounds. However, such models are often not useful in practice because even compounds with high docking scores do not consistently show experimental activity. More accurate methods for activity prediction exist, such as molecular dynamics based binding free energy calculations, but they are too computationally expensive to use in a generative model. To address this challenge, we propose MultiFidelity Latent space Active Learning (MF-LAL), a generative modeling framework that integrates a set of oracles with varying cost-accuracy tradeoffs. We train a surrogate model for each oracle and use these surrogates to guide generation of compounds with high predicted activity. Unlike previous approaches that separately learn the surrogate model and generative model, MF-LAL combines the generative and multi-fidelity surrogate models into a single framework, allowing for more accurate activity prediction and higher quality samples. We train MF-LAL with a novel active learning algorithm to further reduce computational cost. Our experiments on two disease-relevant proteins show that MF-LAL produces compounds with significantly better binding free energy scores than other single and multi-fidelity approaches (~50% improvement in mean binding free energy score).
Paperid:3168
Authors:Kosuke Nakanishi · Akihiro Kubo · Yuji Yasui · Shin Ishii
Abstract:
Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL's inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worstcase scenarios over long time horizons requires both minimizing the agent's cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods.In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and adversary.
Paperid:3169
Authors:Zhoufan Zhu · Ke Zhu
Abstract:
For researchers and practitioners in finance, finding synergistic formulaic alphas is very important but challenging. In this paper, we reconsider the discovery of synergistic formulaic alphas from the viewpoint of sequential decisionmaking, and conceptualize the entire alpha discovery process as a non-stationary and reward-sparse Markov decision process. To overcome the challenges of non-stationarity and reward-sparsity, we propose the AlphaQCM method, a novel distributional reinforcement learning method designed to search for synergistic formulaic alphas efficiently. The AlphaQCM method first learns the Q function and quantiles via a Q network and a quantile network, respectively. Then, the AlphaQCM method applies the quantiled conditional moment method to learn unbiased variance from the potentially biased quantiles. Guided by the learned Q function and variance, the AlphaQCM method navigates the non-stationarity and reward-sparsity to explore the vast search space of formulaic alphas with high efficacy. Empirical applications to real-world datasets demonstrate that our AlphaQCM method significantly outperforms its competitors, particularly when dealing with large datasets comprising numerous stocks.
Paperid:3170
Authors:Anastasiia Koloskova · Youssef Allouah · Animesh Jha · Rachid Guerraoui · Sanmi Koyejo
Abstract:
We address the problem of machine unlearning, where the goal is to remove the influence of specific training data from a model upon request, motivated by privacy concerns and regulatory requirements such as the “right to be forgotten.” Unfortunately, existing methods rely on restrictive assumptions or lack formal guarantees.To this end, we propose a novel method for certified machine unlearning, leveraging the connection between unlearning and privacy amplification by stochastic postprocessing. Our method uses noisy fine-tuning on the retain data, i.e., data that does not need to be removed, to ensure provable unlearning guarantees. This approach requires no assumptions about the underlying loss function, making it broadly applicable across diverse settings. We analyze the theoretical trade-offs in efficiency and accuracy and demonstrate empirically that our method not only achieves formal unlearning guarantees but also performs effectively in practice, outperforming existing baselines.
Paperid:3171
Authors:Hjalmar Wijk · Tao Lin · Joel Becker · Sami Jawhar · Neev Parikh · Lawrence Chan · Thomas Broadley · Michael Chen · Joshua Clymer · Jai Dhyani · Elena Ericheva · Katharyn Garcia · Brian Goodrich · Nikola Jurkovic · Megan Kinniment · Aron Lajko · Seraphina Nix · Lucas Jun Koba Sato · William Saunders · Maksym Taran · Ben West · Elizabeth Barnes
Abstract:
Abstract:Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce REBench (Research Engineering Benchmark, V1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-$k$ with varying time budgets and agent designs, and find that the best AI agents achieve a score 4× higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2× the score of the top AI agent when both are given 32 total hours (across different attempts).
Paperid:3172
Authors:Quan Xiao · Hui Yuan · A F M Saif · Gaowen Liu · Ramana Kompella · Mengdi Wang · Tianyi Chen
Abstract:
Diffusion models, which iteratively denoise data samples to synthesize highquality outputs, have achieved empirical success across domains. However, optimizing these models for downstream tasks often involves nested bilevel structures—such as tuning hyperparameters for fine-tuning tasks or noise schedules in training dynamics—where traditional bilevel methods fail due to the infinite-dimensional probability space and prohibitive sampling costs. We formalize this challenge as a generative bilevel optimization problem and address two key scenarios: (1) fine-tuning pre-trained models via an inference-only lower-level solver paired with a sample-efficient gradient estimator for the upper level, and (2) training diffusion model from scratch with noise schedule optimization by reparameterizing the lower-level problem and designing a computationally tractable gradient estimator. Our first-order bilevel framework overcomes the incompatibility of conventional bilevel methods with diffusion processes, offering theoretical grounding and computational practicality. Experiments demonstrate that our method outperforms existing fine-tuning and hyperparameter search baselines.
Paperid:3173
Authors:Thomas Paniagua · Chinmay Savadikar · Tianfu Wu
Abstract:
Abstract:Whitebox targeted adversarial attacks expose fundamental vulnerabilities of Deep Neural Networks (DNNs) across computer vision and beyond. However, two critical challenges remain unresolved: (i) How many target classes can be attacked simultaneously in a specified order, known as the \textit{ordered top-$K$ attack} problem ($K \geq 1$)? (ii) What are the resulting ordered top-$K$ adversarial perturbations for a given benign image?In this paper, we provide a simple yet elegant solution to both challenges by establishing that **ordered top-$K$ adversarial perturbations can be expressed as linear combinations of the right singular vectors (corresponding to non-zero singular values) of the attack-targets-ranking constrained logit-to-image Jacobian matrix**. These right singular vectors form an orthogonal basis in the image space, spanning the most informative subspace for learning adversarial perturbations.We propose \textbf{RisingAttacK}, a novel method within the Sequential Quadratic Programming framework, which efficiently exploits this structure to optimize adversarial perturbations. We propose a holistic figure-of-merits (FoM) evaluation metric based on attack success rates (ASRs) and $\ell_p$-norms of perturbations.Our RisingAttacK significantly outperforms the previous state-of-the-art approach, QuadAttacK, consistently across all top-$K$ (1, 5, 10, 20, 25) and four models (ResNet-50, DenseNet-121, ViT-B and DEiT-S) on ImageNet-1k in experiments.
Paperid:3174
Authors:Yuxuan Zhu · Dylan Bowman · Akul Gupta · Adarsh Danda · Antony Kellermann · Avi Dhir · Conner Jensen · Eric Ihli · Jason Benn · Jet Geronimo · Kaicheng Yu · Philip Li · Richard Fang · Sudhit Rao · Twm Stone · Daniel Kang
Abstract:
Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a realworld benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state-of-the-art agent framework can resolve up to 13% or 25% vulnerabilities with or without vulnerability descriptions.
Paperid:3175
Authors:Rui Yang · Lin Song · Yicheng Xiao · Runhui Huang · Yixiao Ge · Ying Shan · Hengshuang Zhao
Abstract:
Recent advancements in large language models (LLMs) have significantly propelled the development of large multimodal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.
Paperid:3176
Authors:Saman Forouzandeh · Parham Moradi · Mahdi Jalili
Abstract:
This paper proposes SHARPDistill (\textbf{S}peedy \textbf{H}ypergraph \textbf{A}nd \textbf{R}eview-based \textbf{P}ersonalised \textbf{Distill}ation), a novel knowledge distillation approach based on the teacher-student framework that combines Hypergraph Neural Networks (HGNNs) with language models to enhance recommendation quality while significantly improving inference time. The teacher model leverages HGNNs to generate user and item embeddings from interaction data, capturing high-order and group relationships, and employing a pre-trained language model to extract rich semantic features from textual reviews. We utilize a contrastive learning mechanism to ensure structural consistency between various representations. The student includes a shallow and lightweight GCN called CompactGCN designed to inherit high-order relationships while reducing computational complexity. Extensive experiments on real-world datasets demonstrate that SHARP-Distill achieves up to 68× faster inference time compared to HGNN and 40× faster than LightGCN while maintaining competitive recommendation accuracy.
Paperid:3177
Authors:Anjiang Wei · Allen Nie · Thiago Teixeira · Rohan Yadav · Wonchan Lee · Ke Wang · Alex Aiken
Abstract:
Abstract:Modern scientific discovery increasingly relies on highperformance computing for complex modeling and simulation. A key challenge in improving parallel program performance is efficiently mapping tasks to processors and data to memory, a process dictated by intricate, low-level system code known as mappers. Developing high-performance mappers demands days of manual tuning, posing a significant barrier for domain scientists without systems expertise. We introduce a framework that automates mapper development with generative optimization, leveraging richer feedback beyond scalar performance metrics. Our approach features the Agent-System Interface, which includes a Domain-Specific Language (DSL) to abstract away low-level complexity of system code and define a structured search space, as well as AutoGuide, a mechanism that interprets raw execution output into actionable feedback. Unlike traditional reinforcement learning methods such as OpenTuner, which rely solely on scalar feedback, our method finds superior mappers in far fewer iterations. With just 10 iterations, it outperforms OpenTuner even after 1000 iterations, achieving $3.8\times$ faster performance. Our approach finds mappers that surpass expert-written mappers by up to $1.34\times$ speedup across nine benchmarks while reducing tuning time from days to minutes.
Paperid:3178
Authors:Loek van Rossem · Andrew Saxe
Abstract:
Even when massively overparameterized, deep neural networks show a remarkable ability to generalize. Research on this phenomenon has focused on generalization within distribution, via smooth interpolation. Yet in some settings neural networks also learn to extrapolate to data far beyond the bounds of the original training set, sometimes even allowing for infinite generalization, implying that an algorithm capable of solving the task has been learned. Here we undertake a case study of the learning dynamics of recurrent neural networks trained on the streaming parity task in order to develop an effective theory of algorithm development. The streaming parity task is a simple but nonlinear task defined on sequences up to arbitrary length. We show that, with sufficient finite training experience, RNNs exhibit a phase transition to perfect infinite generalization. Using an effective theory for the representational dynamics, we find an implicit representational merger effect which can be interpreted as the construction of a finite automaton that reproduces the task. Overall, our results disclose one mechanism by which neural networks can generalize infinitely from finite training experience.
Paperid:3179
Authors:Shiqing Gao · Yihang Zhou · Shuai Shao · Haoyu Luo · Yiheng Bing · Jiaxin Ding · Luoyi Fu · Xinbing Wang
Abstract:
Ensuring safety is a critical challenge that limits the application of Reinforcement Learning (RL) in realworld scenarios. Constrained Reinforcement Learning (CRL) addresses this by maximizing expected returns while satisfying predefined constraints, typically formulated as the expected cumulative cost. However, expectation-based constraints overlook rare but high-impact extreme value events in the tail distribution, such as black swan incidents, which can lead to severe constraint violations. To address this issue, we propose the Extreme Value policy Optimization (EVO) algorithm, leveraging Extreme Value Theory (EVT) to model and exploit extreme reward and cost samples, reducing constraint violations. EVO introduces an extreme quantile optimization objective to explicitly capture extreme samples in the cost tail distribution. Additionally, we propose an extreme prioritization mechanism during replay, amplifying the learning signal from rare but high-impact extreme samples. Theoretically, we establish upper bounds on expected constraint violations during policy updates, guaranteeing strict constraint satisfaction at a zero-violation quantile level. Further, we demonstrate that EVO achieves a lower probability of constraint violations than expectation-based methods and exhibits lower variance than quantile regression methods. Extensive experiments show that EVO significantly reduces constraint violations during training while maintaining competitive policy performance compared to baselines.
Paperid:3180
Authors:Shashwat Goel · Joschka Strüber · Ilze Amanda Auzina · Karuna Chandra · Ponnurangam Kumaraguru · Douwe Kiela · Ameya Pandurang Prabhu · Matthias Bethge · Jonas Geiping
Abstract:
As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to asAI Oversight. We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLMas-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.
Paperid:3181
Authors:Su Jia · Peter Frazier · Nathan Kallus
Abstract:
Abstract:Experimentation with interference poses a significant challenge in contemporary online platforms. Prior research on experimentation with interference has concentrated on the final output of a policy. Cumulative performance, while equally important, is less well understood. To address this gap, we introduce the problem of Multiarmed Bandits with Interference (MABI), where the learner assigns an arm to each of $N$ experimental units over $T$ rounds. The reward of each unit in each round depends on the treatments of all units, where the interference between two units decays in their distance. The reward functions are chosen by an adversary and may vary arbitrarily over time and across different units. We first show that the optimal expected regret (against the best fixed-arm policy) is $\tilde O(\sqrt T)$, and can be achieved by a switchback policy. However, the regret (as a random variable) for any switchback policy suffers a high variance, since it does not account for $N$. We propose a policy based on a novel clustered randomization scheme, whose regret (i) is optimal in expectation and (ii) admits a high probability bound that vanishes in $N$.
Paperid:3182
Authors:Rickard Gabrielsson · Jiacheng Zhu · Onkar Bhardwaj · Leshem Choshen · Kristjan Greenewald · Mikhail Yurochkin · Justin Solomon
Abstract:
Finetuning large language models (LLMs) with low-rank adaptations (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRAs. We propose a method for the joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. We extend our algorithm to learn clusters of LoRAs that are amenable to joint compression, allowing it to scale gracefully to large LoRA collections. Our experiments with up to 1000 LoRAs demonstrate that compressed LoRAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand LoRAs, maintaining 80\% of the throughput of serving a single LoRA.
Paperid:3183
Authors:Magzhan Gabidolla · Miguel Carreira-Perpinan
Abstract:
We explore ensembles of axisaligned decision stumps, which can be viewed as a generalized additive model (GAM). In this model, stumps utilizing the same feature are grouped to form a shape function for that feature. Instead of relying on boosting or bagging, we employ alternating optimization to learn a fixed-size stump forest. We optimize the parameters of each stump exactly through enumeration, given the other stumps are fixed. For fixed stump splits, the leaf values are optimized jointly by solving a convex problem. To address the overfitting issue inherent in naive optimization of stump forests, we propose effective regularization techniques. Our regularized stump forests achieve accuracy comparable to state-of-the-art GAM methods while using fewer parameters. This work is the first to successfully learn stump forests without employing traditional ensembling techniques like bagging or boosting.
Paperid:3184
Authors:Jiahao Zhu · Kang You · Dandan Ding · Zhan Ma
Abstract:
Abstract:Reflectance attributes in LiDAR point clouds provide essential information for downstream tasks but remain underexplored in neural compression methods. To address this, we introduce SerLiC, a serializationbased neural compression framework to fully exploit the intrinsic characteristics of LiDAR reflectance. SerLiC first transforms 3D LiDAR point clouds into 1D sequences via scan-order serialization, offering a device-centric perspective for reflectance analysis. Each point is then tokenized into a contextual representation comprising its sensor scanning index, radial distance, and prior reflectance, for effective dependencies exploration. For efficient sequential modeling, Mamba is incorporated with a dual parallelization scheme, enabling simultaneous autoregressive dependency capture and fast processing. Extensive experiments demonstrate that SerLiC attains over 2$\times$ volume reduction against the original reflectance data, outperforming the state-of-the-art method by up to 22\% reduction of compressed bits while using only 2\% of its parameters. Moreover, a lightweight version of SerLiC achieves $\geq 10$ fps (frames per second) with just 111K parameters, which is attractive for real applications.
Paperid:3185
Authors:Jianliang He · Xintian Pan · Siyu Chen · Zhuoran Yang
Abstract:
We study how multihead softmax attention networks are trained to perform in-context learning with linear data. Through a blend of theoretical insights and empirical investigations, we reveal the emergence of elegant attention patterns, such as diagonal query-key structures and last-entry value-output formations, which collectively emulate preconditioned gradient descent. By coupling positive and negative attention heads, these models achieve remarkable expressiveness, debiasing capabilities, and optimization efficiency. Furthermore, we analyze the training dynamics, showcasing how transformers converge toward near-optimal solutions in high-dimensional regimes, outperforming single-head models. Extensions to non-isotropic covariates and multi-task regression demonstrate the flexibility and robustness of this approach. Our findings illuminate the mathematical elegance and expressive potential of transformers, paving the way for deeper understanding and broader applications of in-context learning.
Paperid:3186
Authors:Tao Li · Zhengbao He · Yujun Li · Yasheng Wang · Lifeng Shang · Xiaolin Huang
Abstract:
Finetuning large-scale pre-trained models is prohibitively expensive in terms of computation and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, offers an efficient solution by optimizing only low-rank matrices. Despite recent progress in improving LoRA's performance, the relationship between the LoRA optimization space and the full parameter space is often overlooked. A solution that appears flat in the loss landscape of the LoRA space may still exhibit sharp directions in the full parameter space, potentially undermining generalization performance. In this paper, we introduce Flat-LoRA, which aims to identify a low-rank adaptation situated in a flat region of the full parameter space. Instead of adopting the well-established sharpness-aware minimization approach, which incurs significant computation and memory overheads, we employ a Bayesian expectation loss objective to preserve training efficiency. Further, we design a refined random perturbation generation strategy for improved performance and carefully manage memory overhead using random seeds. Experiments across diverse tasks—including mathematical reasoning, coding abilities, dialogue generation, instruction following, and text-to-image generation—demonstrate that Flat-LoRA improves both in-domain and out-of-domain generalization.
Paperid:3187
Authors:Yuhui Zhang · Yuchang Su · Chenyu Wang · Tianhong Li · Zoe Wefers · Jeffrey J. Nirschl · James Burgess · Daisy Yi Ding · Alejandro Lozano · Emma Lundberg · Serena Yeung
Abstract:
Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology. We introduce CellFlow, an imagegenerative model that simulates cellular morphology changes induced by chemical and genetic perturbations using flow matching. Unlike prior methods, CellFlow models distribution-wise transformations from unperturbed to perturbed cell states, effectively distinguishing actual perturbation effects from experimental artifacts such as batch effects—a major challenge in biological data. Evaluated on chemical (BBBC021), genetic (RxRx1), and combined perturbation (JUMP) datasets, CellFlow generates biologically meaningful cell images that faithfully capture perturbation-specific morphological changes, achieving a 35% improvement in FID scores and a 12% increase in mode-of-action prediction accuracy over existing methods. Additionally, CellFlow enables continuous interpolation between cellular states, providing a potential tool for studying perturbation dynamics. These capabilities mark a significant step toward realizing virtual cell modeling for biomedical research.
Paperid:3188
Authors:Zhikang Chen · Abudukelimu Wuerkaixi · Sen Cui · Haoxuan Li · Ding Li · Jingfeng ZHANG · Bo Han · Gang Niu · Houfang Liu · Yi Yang · Sifan YANG · Changshui Zhang · Tianling Ren
Abstract:
Deep networks are prone to catastrophic forgetting during sequential task learning, i.e., losing the knowledge about old tasks upon learning new tasks. To this end, continual learning (CL) has emerged, whose existing methods focus mostly on regulating or protecting the parameters associated with the previous tasks. However, parameter protection is often impractical, since the size of parameters for storing the oldtask knowledge increases linearly with the number of tasks, otherwise it is hard to preserve the parameters related to the old-task knowledge. In this work, we bring a neuroscience opinion to CL: in neural networks, the pathways matter more than the parameters when concerning the knowledge acquired from the old tasks. Following this opinion, we propose a novel CL framework, learning without isolation (LwI), where model fusion is formulated as graph matching and the pathways occupied by the old tasks are protected without being isolated. Thanks to the sparsity of activation channels in a deep network, LwI can adaptively allocate available pathways for a new task, realizing pathway protection and addressing catastrophic forgetting in a parameter-efficient manner. Experiments on popular benchmark datasets demonstrate the superiority of the proposed LwI.
Paperid:3189
Authors:Andy Dong · Wei-Ning Chen · Ayfer Ozgur
Abstract:
We study how inherent randomness in the training process—where each sample (or client in federated learning) contributes only to a randomly selected portion of training—can be leveraged for privacy amplification. This includes (1) data partitioning, where a sample participates in only a subset of training iterations, and (2) model partitioning, where a sample updates only a subset of the model parameters. We apply our framework to model parallelism in federated learning, where each client updates a randomly selected subnetwork to reduce memory and computational overhead, and show that existing methods, e.g. model splitting or dropout, provide a significant privacy amplification gain not captured by previous privacy analysis techniques. Additionally, we introduce balanced iteration subsampling, a new data partitioning method where each sample (or client) participates in a fixed number of training iterations. We show that in certain regimes, this method yields stronger privacy amplification than Poisson (i.i.d.) sampling of data (or clients). Our results demonstrate that randomness in the training process, which is structured rather than i.i.d. and interacts with data in complex ways, can be systematically leveraged for nontrivial privacy amplification.
Paperid:3190
Authors:Yash Kheshwani · Avishek Ghosh · Nikhil Karamchandani
Abstract:
Abstract:This work investigates the problem of best arm identification for multiagent multi-armed bandits. We consider $N$ agents grouped into $M$ clusters, where each cluster solves a stochastic bandit problem. The mapping between agents and bandits is \textit{a priori} unknown. Each bandit is associated with $K$ arms, and the goal is to identify the best arm for each agent under a $\delta$-probably correct ($\delta$-PC) framework, while minimizing sample complexity and communication overhead. We propose two novel algorithms: \emph{Clustering then Best Arm Identification} (\texttt{Cl-BAI}) and \emph{Best Arm Identification then Clustering} (\texttt{BAI-Cl}). \texttt{Cl-BAI} employs a two-phase approach that first clusters agents based on the bandit problems they are learning, followed by identifying the best arm for each cluster. \texttt{BAI-Cl} reverses the sequence by identifying the best arms first and then clustering agents accordingly. Both algorithms exploit the successive elimination framework to ensure computational efficiency and high accuracy. Theoretical analysis establishes $\delta$-PC guarantees for both methods, derives bounds on their sample complexity, and provides a lower bound for the problem class. Moreover, when $M$ is small (a constant), we show that the sample complexity of (a variant of) \texttt{BAI-Cl} is (order-wise) minimax optimal. Experiments on synthetic and real-world (Movie Lens, Yelp) data demonstrates the superior performance of the proposed algorithms in terms of sample and communication efficiency, particularly in settings where $M \ll N$.
Paperid:3191
Authors:Alekh Agarwal · Christoph Dann · Teodor Vanislavov Marinov
Abstract:
Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize loglikelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.
Paperid:3192
Authors:Yixuan Li · Changli Tang · Jimin Zhuang · Yudong Yang · Guangzhi Sun · Wei Li · Zejun MA · Chao Zhang
Abstract:
Abstract:Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frameper-second (FPS) $\leqslant$2, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information.Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis (*e.g.*, basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro.Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. Upon acceptance, we will release the source code, model checkpoints, and data.
Paperid:3193
Authors:Zhengzheng Lou · Hang Xue · Chaoyang Zhang · Shizhe Hu
Abstract:
Despite the superior capability in complementary information exploration and consistent clustering structure learning, most current weightbased multi-modal clustering methods still contain three limitations: 1) lack of trustworthiness in learned weights; 2) isolated view weight learning; 3) extra weight parameters. Motivated by the peer-review mechanism in the academia, we in this paper give a new peer-review look on the multi-modal clustering problem and propose to iteratively treat one modality as ``author" and the remaining modalities as ``reviewers" so as to reach a peer-review score for each modality. It essentially explores the underlying relationships among modalities. To improve the trustworthiness, we further design a new trustworthy score with a self-supervision working mechanism. Following that, we propose a novel Peer-review Trustworthy Information Bottleneck (PTIB) method for weighted multi-modal clustering, where both the above scores are simultaneously taken into account for accurate and parameter-free modality weight learning. Extensive experiments on eight multi-modal datasets suggest that PTIB can outperform the state-of-the-art non-deep and deep multi-modal clustering methods.
Paperid:3194
Authors:Yu-Wei Chen · Raghu Pasupathy · Jordan A Awan
Abstract:
This work identifies the first privacyaware stratified sampling scheme that minimizes the variance for general private mean estimation under the Laplace, Discrete Laplace (DLap) and Truncated-Uniform-Laplace (TuLap) mechanisms within the framework of differential privacy (DP). We view stratified sampling as a subsampling operation, which amplifies the privacy guarantee; however, to have the same final privacy guarantee for each group, different nominal privacy budgets need to be used depending on the subsampling rate. Ignoring the effect of DP, traditional stratified sampling strategies risk significant variance inflation. We phrase our optimal survey design as an optimization problem, where we determine the optimal subsampling sizes for each group with the goal of minimizing the variance of the resulting estimator. We establish strong convexity of the variance objective, propose an efficient algorithm to identify the integer-optimal design, and offer insights on the structure of the optimal design.
Paperid:3195
Authors:Rui Ye · shuo tang · Rui Ge · Yaxin Du · Zhenfei Yin · Jing Shao · Siheng Chen
Abstract:
LLMbased multi-agent systems (MAS) have shown significant potential in tackling diverse tasks.However, to design effective MAS, existing approaches heavily rely on manual configurations or multiple calls of advanced LLMs, resulting in inadaptability and high inference costs.In this paper, we simplify the process of building an MAS by reframing it as a generative language task, where the input is a user query and the output is a corresponding MAS.To address this novel task, we unify the representation of MAS as executable code and propose a consistency-oriented data construction pipeline to create a high-quality dataset comprising coherent and consistent query-MAS pairs.Using this dataset, we train MAS-GPT, an open-source medium-sized LLM that is capable of generating query-adaptive MAS within a single LLM inference. The generated MAS can be seamlessly applied to process user queries and deliver high-quality responses. Extensive experiments on 9 benchmarks and 4 LLMs show that the proposed MAS-GPT consistently outperforms 10+ baseline MAS methods on diverse settings, indicating MAS-GPT's high effectiveness, efficiency and strong generalization ability.All codes, models, and data will be open-sourced.
Paperid:3196
Authors:Yang Yu · Kai Han · Hang Zhou · Yehui Tang · Kaiqi Huang · Yunhe Wang · Dacheng Tao
Abstract:
While largescale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic data interactions and model learning trajectories. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, We further analyze how a model’s data preferences evolve throughout training, providing new insights into the data preference of the model during training.
Paperid:3197
Authors:Cyrille Kone · Emilie Kaufmann · Laura Richert
Abstract:
Abstract:In this paper, we address the problem of identifying the Pareto Set under feasibility constraints in a multivariate bandit setting. Specifically, given a $K$armed bandit with unknown means $\mu_1, \dots, \mu_K \in \mathbb{R}^d$, the goal is to identify the set of arms whose mean is not uniformly worse than that of another arm (i.e., not smaller for all objectives), while satisfying some known set of linear constraints, expressing, for example, some minimal performance on each objective. Our focus lies in fixed-confidence identification, for which we introduce an algorithm that significantly outperforms racing-like algorithms and the intuitive two-stage approach that first identifies feasible arms and then their Pareto Set. We further prove an information-theoretic lower bound on the sample complexity of any algorithm for constrained Pareto Set identification, showing that the sample complexity of our approach is near-optimal. Our theoretical results are supported by an extensive empirical evaluation on a series of benchmarks.
Paperid:3198
Authors:Benjamin Holzschuh · Qiang Liu · Georg Kohl · Nils Thuerey
Abstract:
We introduce PDETransformer, an improved transformer-based architecture for surrogate modeling of physics simulations on regular grids. We combine recent architectural improvements of diffusion transformers with adjustments specific for large-scale simulations to yield a more scalable and versatile general-purpose transformer architecture, which can be used as the backbone for building large-scale foundation models in physical sciences. We demonstrate that our proposed architecture outperforms state-of-the-art transformer architectures for computer vision on a large dataset of 16 different types of PDEs. We identify the expansion rate of token embeddings as a key indicator of performance. To keep the expansion rate for different tasks and PDEs constant we propose to embed different physical channels individually as spatio-temporal tokens, which interact via channel-wise self-attention. We demonstrate that our pretrained models achieve improved performance on several challenging downstream tasks compared to training from scratch and also beat other foundation model architectures for physics simulations.We will publicly release our code and model weights upon acceptance.
Paperid:3199
Authors:Simin Chen · Pranav Pusarla · Baishakhi Ray
Abstract:
The rapid evolution of code largelanguage models (LLMs) underscores the need for effective and transparent benchmarking of their reasoning capabilities. However, the current benchmarking approach heavily depends on publicly available, humancreated datasets. The widespread use of these fixed benchmark datasets makes the benchmarking process to be \textit{static} and thus particularly susceptible to data contamination—an unavoidable consequence of the extensive data collection processes used to train Code LLMs.Existing approaches that address data contamination often suffer from human effort limitations and imbalanced problem complexity. To tackle these challenges, we propose DyCodeEval, a novel benchmarking suite for evaluating Code LLMs under potential data contamination. Given a seed programming problem, DyCodeEval employs multiple agents to extract and modify the context without altering the core logic, generating semantically equivalent variations. We introduce a new evaluation metric and conduct empirical studies on two seed datasets across eight Code LLMs. Results show that DyCodeEval effectively benchmarks reasoning capabilities under contamination risks while generating diverse problem sets to ensure consistent and reliable evaluations.
Paperid:3200
Authors:Tushar Aggarwal · Swayam Singh · Abhijeet Awasthi · Aditya Kanade · Nagarajan Natarajan
Abstract:
Software engineering activities frequently involve edits to existing code. However, contemporary code language models (LMs) lack the ability to handle diverse types of codeedit requirements. In this work, we attempt to overcome this shortcoming through (1) a novel synthetic data generation pipeline and (2) a robust model adaptation algorithm. Starting with seed code examples and diverse editing criteria, our pipeline generates high-quality samples comprising original and modified code, along with natural language instructions in different styles and verbosity. Today's code LMs come bundled with strong abilities, such as code generation and instruction following, which should not be lost due to fine-tuning. To ensure this, we propose a novel adaptation algorithm, SeleKT, that (a) leverages a dense gradient-based step to identify the weights that are most important for code editing, and (b) does a sparse projection onto the base model to avoid overfitting. Using our approach, we obtain a new model NextCoder (adapted from Qwen2.5-Coder-7B) that achieves strong results on four code-editing benchmarks, outperforming comparable size models and even several larger ones. We show the generality of our approach by improving DeepSeekCoder-6.7B and Qwen2.5-Coder-7B, compare against other fine-tuning approaches, and demonstrate robustness by showing retention of code generation abilities post adaptation.
Paperid:3201
Authors:Daniele Tramontano · Yaroslav Kivva · Saber Salehkaleybar · Negar Kiyavash · Mathias Drton
Abstract:
This paper investigates causal effect identification in latent variable Linear NonGaussian Acyclic Models (lvLiNGAM) using higher-order cumulants, addressing two prominent setups that are challenging in the presence of latent confounding: (1) a single proxy variable that may causally influence the treatment and (2) underspecified instrumental variable cases where fewer instruments exist than treatments. We prove that causal effects are identifiable with a single proxy or instrument and provide corresponding estimation methods. Experimental results demonstrate the accuracy and robustness of our approaches compared to existing methods, advancing the theoretical and practical understanding of causal inference in linear systems with latent confounders.
Paperid:3202
Authors:Jingxin Liu · Renda Han · Wenxuan Tu · Haotian Wang · Junlong Wu · Jieren Cheng
Abstract:
Abstract:Subgraphs of a complete graph are usually distributed across multiple devices, and only locally accessible due to privacy restrictions. However, existing nodelevel federated graph learning suffers from one of the least following issues: 1) heavily relying on labeled graph samples that are difficult to obtain in real-world applications, and 2) partitioning a complete graph into several subgraphs inevitably causes missing links, leading to sub-optimal sample representations. To solve these issues, we propose a novel $\underline{\text{Fed}}$erated $\underline{\text{N}}$ode-level $\underline{\text{C}}$lustering $\underline{\text{N}}$etwork (FedNCN), which mends the destroyed cross-subgraph links using clustering prior knowledge without sharing private data. Specifically, within each client, to ensure data privacy, we first design an MLP-based projector to implicitly preserve key clustering properties of a subgraph in a denoising learning-like manner, and then upload the resultant clustering signals that are hard to reconstruct for subsequent cross-subgraph links restoration. In the server, we maximize the potential affinity between subgraphs stemming from clustering signals by graph similarity estimation and minimize redundant links via the N-Cut criterion. Moreover, we employ a GNN-based generator to learn consensus prototypes from this mended graph, enabling the MLP-GNN joint-optimized learner to enhance data privacy protection and promote the local model for better clustering. Extensive experiments demonstrate the superiority of FedNCN.
Paperid:3203
Authors:Jingwen Fu · Nanning Zheng
Abstract:
Distributional Compositional Generalization (DCG) refers to the ability to tackle tasks from new distributions by leveraging the knowledge of concepts learned from supporting distributions. In this work, we aim to explore the statistical mechanisms of DCG, which have been largely overlooked in previous studies. By statistically formulating the problem, this paper seeks to address two key research questions: 1) Can a method to one DCG problem be applicable to another? 2) What statistical properties can indicate a learning algorithm's capacity for knowledge composition in DCG tasks? \textbf{To address the first question}, an invariant measure is proposed to provide a dimension where all different methods converge. This measure underscores the critical role of data in enabling improvements without tradeoffs. \textbf{As for the second question}, we reveal that by decoupling the impacts of insufficient data and knowledge composition, the ability of the learning algorithm to compose knowledge relies on the compatibility and sensitivity between the learning algorithm and the composition rule. In summary, the statistical analysis of the generalization mechanisms provided in this paper deepens our understanding of compositional generalization, offering a complementary perspective to prior studies.
Paperid:3204
Authors:Doron Cohen · Aryeh Kontorovich · Roi Weiss
Abstract:
We revisit the recently introduced Local GlivenkoCantelli setting, which studies distribution-dependent uniform convergence rates of the Empirical Mean Estimator (EME). In this work, we investigate generalizations of this setting where arbitrary estimators are allowed rather than just the EME. Can a strictly larger class of measures be learned? Can better risk decay rates be obtained? We provide exhaustive answers to these questions—which are both negative, provided the learner is barred from exploiting some infinite-dimensional pathologies. On the other hand, allowing such exploits does lead to a strictly larger class of learnable measures.
Paperid:3205
Authors:Neil Mallinar · Daniel Beaglehole · Libin Zhu · Adityanarayanan Radhakrishnan · Parthe Pandit · Misha Belkin
Abstract:
Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descentbased optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM uses such block-circulant features to implement the Fourier Multiplication Algorithm, which prior work posited as the generalizing solution neural networks learn on these tasks. Our results demonstrate that emergence can result purely from learning task-relevant features and is not specific to neural architectures nor gradient descent-based optimization methods. Furthermore, our work provides more evidence for AGOP as a key mechanism for feature learning in neural networks.
Paperid:3206
Authors:Nicolás Astorga · Tennison Liu · Yuanzhang Xiao · M van der Schaar
Abstract:
Mathematical optimization is fundamental to decisionmaking across diverse domains, from operations research to healthcare. Yet, translating real-world problems into optimization models remains a difficult task, often demanding specialized expertise. This paper approaches the problem ofautoformulation---the automated creation of solver-ready optimization models from natural language problem descriptions.We identify three core challenges of autoformulation:(1)the vast, problem-dependent hypothesis space,(2)efficient and diverse exploration of this space under uncertainty, and(3)evaluation of formulation correctness against problem description.To address these challenges, we present a novel method leveragingLarge Language Models(LLMs) withMonte-Carlo Tree Search, exploiting the hierarchical nature of optimization modeling to generate and systematically explore possible formulations. To enhance search efficiency, we introduce symbolic pruning to eliminate trivially equivalent search paths (branches), and employ LLM-based evaluation of partial formulations to guide search.Empirical analysis on linear and mixed-integer programming benchmarks demonstrates our method's effectiveness, with significant performance gains from both LLM-based value estimation and symbolic pruning techniques.
Paperid:3207
Authors:Alessio Russo · Aldo Pacchiano
Abstract:
Abstract:We study the policy evaluation problem in an online multireward multi-policy discounted setting, where multiple reward functions must be evaluated simultaneously for different policies. We adopt an $(\epsilon,\delta)$-PAC perspective to achieve $\epsilon$-accurate estimates with high confidence across finite or convex sets of rewards, a setting that has not been investigated in the literature. Building on prior work on Multi-Reward Best Policy Identification, we adapt the MR-NaS exploration scheme to jointly minimize sample complexity for evaluating different policies across different reward sets. Our approach leverages an instance-specific lower bound revealing how the sample complexity scales with a measure of value deviation, guiding the design of an efficient exploration policy. Although computing this bound entails a hard non-convex optimization, we propose an efficient convex approximation that holds for both finite and convex reward sets. Experiments in tabular domains demonstrate the effectiveness of this adaptive exploration scheme.
Paperid:3208
Authors:Whiyoung Jung · Sunghoon Hong · Deunsol Yoon · Kanghoon Lee · Woohyung Lim
Abstract:
MultiAgent Reinforcement Learning (MARL) faces challenges in sparse reward environments, where agents struggle to learn effective strategies. Macro-actions—sequences of actions executed as single decisions—facilitate long-term planning but introduce asynchrony due to varying durations across agents, complicating Centralized Training with Decentralized Execution (CTDE). Existing methods handle this with padding, which misaligns asynchronous experiences, leading to spurious correlations.We propose an Agent-Centric Actor-Critic (ACAC) algorithm that addresses asynchrony without relying on padding. ACAC employs agent-centric history encoders for independent trajectory processing and an attention-based centralized critic to integrate agent information.Experiments on macro-action benchmarks show that ACAC accelerates convergence and enhances overall performance compared to baseline methods, demonstrating its effectiveness in complex MARL tasks.
Paperid:3209
Authors:Peter Pao-Huang · Mitchell Black · Xiaojie Qiu
Abstract:
Abstract:Generative modeling of graphs with spatial structure is essential across many applications from computer graphics to spatial genomics. Recent flowbased generative models have achieved impressive results by gradually adding and then learning to remove noise from these graphs. Existing models, however, use graph neural network architectures that are independent of the noise level, limiting their expressiveness. To address this issue, we introduce *Noise Conditioned Graph Networks* (NCGNs), a class of graph neural networks that dynamically modify their architecture according to the noise level during generation. Our theoretical and empirical analysis reveals that as noise increases, (1) graphs require information from increasingly distant neighbors and (2) graphs can be effectively represented at lower resolutions. Based on these insights, we develop Dynamic Message Passing (DMP), a specific instantiation of NCGNs that adapts both the range and resolution of message passing to the noise level. DMP consistently outperforms noise independent architectures on a variety of domains including $3$D point clouds, spatiotemporal transcriptomics, and images.
Paperid:3210
Authors:Ryotaro Kawata · Kohsei Matsutani · Yuri Kinoshita · Naoki Nishikawa · Taiji Suzuki
Abstract:
Mixture of Experts (MoE), an ensemble of specialized models equipped with a router that dynamically distributes each input to appropriate experts, has achieved successful results in the field of machine learning. However, theoretical understanding of this architecture is falling behind due to its inherent complexity. In this paper, we theoretically study the sample and runtime complexity of MoE following the stochastic gradient descent when learning a regression task with an underlying cluster structure of single index models. On the one hand, we show that a vanilla neural network fails in detecting such a latent organization as it can only process the problem as a whole. This is intrinsically related to the concept ofinformation exponentwhich is low for each cluster, but increases when we consider the entire task. On the other hand, with a MoE, we show that it succeeds in dividing the problem into easier subproblems by leveraging the ability of each expert to weakly recover the simpler function corresponding to an individual cluster. To the best of our knowledge, this work is among the first to explore the benefits of the MoE framework by examining its SGD dynamics in the context of nonlinear regression.
Paperid:3211
Authors:Jialong Guo · Xinghao Chen · Yehui Tang · Yunhe Wang
Abstract:
Large language models (LLMs) have garnered significant attention and demonstrated impressive capabilities in a wide range of applications. However, due to their enormous computational costs, the deployment and application of LLMs are often severely limited. To address this issue, structured pruning is an effective solution to compress the parameters of LLMs. Determining the importance of each submodule in LLMs and minimizing performance loss are critical issues that need to be carefully addressed in structured pruning. In this paper, we propose an effective and fast structured pruning method named SlimLLM for large language model. For channel and attention head pruning, we evaluate the importance based on the entire channel or head, rather than merely aggregating the importance of individual elements within a sub-module. This approach enables a more holistic consideration of the interdependencies among elements within the sub-module. In addition, we design a simple linear regression strategy for the output matrix to quickly recover performance. We also propose layer-based importance ratio to determine the pruning ratio for each layer. Based on the LLaMA benchmark results, our SlimLLM outperforms other methods and achieves state-of-the-art performance.
Paperid:3212
Authors:Jingtan Wang · Xiaoqiang Lin · Rui Qiao · Pang Wei Koh · Chuan-Sheng Foo · Bryan Kian Hsiang Low
Abstract:
Curating data for instruction tuning is crucial for enhancing the performance of large language models (LLMs). This work aims to select training data for instruction tuning to improve the LLM performance on specific tasks. Existing methods often rely on nexttoken prediction (NTP) loss as a proxy for target task performance due to the non-differentiable nature of performance evaluation metrics. They select training data points that are most helpful in reducing validation loss. However, there is a discrepancy between minimizing NTP loss and maximizing performance (e.g., code pass rate in code generation). To remedy this, we introduce a novel Non-differentiable evaluation metric-based InfluenCe Estimation (NICE), which leverages the policy gradient to select the training data that improves the performance. Moreover, NICE can perform data selection in the absence of labels (ground-truth responses) when the evaluation metrics do not require labels (e.g., a reward model can output reward scores without supervision from labels). Experimental results show that our approach outperforms existing data selection baselines that use NTP loss in diverse and realistic scenarios. Notably, subsets selected by NICE often produce models that outperform those trained on the full dataset.
Paperid:3213
Authors:Fanmeng Wang · Minjie Cheng · Hongteng Xu
Abstract:
Predicting molecular groundstate conformation (i.e., energy-minimized conformation) is crucial for many chemical applications such as molecular docking and property prediction. Classic energy-based simulation is time-consuming when solving this problem while existing learning-based methods have advantages in computational efficiency but sacrifice accuracy and interpretability.In this work, we propose a novel and effective method to bridge the energy-based simulation and the learning-based strategy, which designs and learns a Wasserstein gradient flow-driven SE(3)-Transformer, called WGFormer, for molecular ground-state conformation prediction. Specifically, our method tackles this task within an auto-encoding framework, which encodes low-quality conformations by the proposed WGFormer and decodes corresponding ground-state conformations by an MLP.The architecture of WGFormer corresponds to Wasserstein gradient flows --- it optimizes molecular conformations by minimizing an energy function defined on the latent mixture models of atoms, thereby significantly improving performance and interpretability.Extensive experiments show that our method consistently outperforms state-of-the-art competitors, providing a new and insightful paradigm to predict molecular ground-state conformation.
Paperid:3214
Authors:Weihan Li · Linyun Zhou · YangJian · Shengxuming Zhang · Xiangtong Du · Xiuming Zhang · Jing Zhang · Chaoqing Xu · Mingli Song · Zunlei Feng
Abstract:
Pathology image segmentation plays a pivotal role in artificial digital pathology diagnosis and treatment. Existing approaches to pathology image segmentation are hindered by laborintensive annotation processes and limited accuracy in tail-class identification, primarily due to the long-tail distribution inherent in gigapixel pathology images. In this work, we introduce the Laplace Diffusion Model, referred to as L-Diffusion, an innovative framework tailored for efficient pathology image segmentation. L-Diffusion utilizes multiple Laplace distributions, as opposed to Gaussian distributions, to model distinct components—a methodology supported by theoretical analysis that significantly enhances the decomposition of features within the feature space. A sequence of feature maps is initially generated through a series of diffusion steps. Following this, contrastive learning is employed to refine the pixel-wise vectors derived from the feature map sequence. By utilizing these highly discriminative pixel-wise vectors, the segmentation module achieves a harmonious balance of precision and robustness with remarkable efficiency. Extensive experimental evaluations demonstrate that L-Diffusion attains improvements of up to 7.16\%, 26.74\%, 16.52\%, and 3.55\% on tissue segmentation datasets, and 20.09\%, 10.67\%, 14.42\%, and 10.41\% on cell segmentation datasets, as quantified by DICE, MPA, mIoU, and FwIoU metrics. The source are are available at https://anonymous.4open.science/r/LDiffusion-F027/README.md
Paperid:3215
Authors:Zhenxing Mi · Kuan-Chieh Wang · Guocheng Qian · Hanrong Ye · Runtao Liu · Sergey Tulyakov · Kfir Aberman · Dan Xu
Abstract:
This paper presents ThinkDiff, a novel alignment paradigm that empowers textto-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that theLLM decodershares the same input feature space withdiffusion decodersthat use the correspondingLLM encoderfor prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images.
Paperid:3216
Authors:Shuangfei Zhai · Ruixiang Zhang · Preetum Nakkiran · David Berthelot · Jiatao Gu · Huangjie Zheng · Tianrong Chen · Miguel Angel Bautista Martin · Navdeep Jaitly · Joshua M Susskind
Abstract:
Normalizing Flows (NFs) are likelihoodbased models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In this work, we demonstrate that NFs are more powerful than previously believed. We present TarFlow: a simple and scalable architecture that enables highly performant NF models. TarFlow can be thought of as a Transformer-based variant of Masked Autoregressive Flows (MAFs): it consists of a stack of autoregressive Transformer blocks on image patches, alternating the autoregression direction between layers. TarFlow is straightforward to train end-to-end, and capable of directly modeling and generating pixels. We also propose three key techniques to improve sample quality: Gaussian noise augmentation during training, a post training denoising procedure, and an effective guidance method for both class-conditional and unconditional settings. Putting these together, TarFlow sets new state-of-the-art results on likelihood estimation for images, beating the previous best methods by a large margin, and generates samples with quality and diversity comparable to diffusion models, for the first time with a stand-alone NF model.
Paperid:3217
Authors:Jaejun Lee · Joyce Whang
Abstract:
Hyperrelational knowledge graphs (HKGs) enrich knowledge graphs by extending a triplet to a hyper-relational fact, where a set of qualifiers adds auxiliary information to a triplet. While many HKG representation learning methods have been proposed, they often fail to effectively utilize the HKG's structure. This paper demonstrates that thoroughly leveraging the structure of an HKG is crucial for reasoning on HKGs, and a purely structure-based representation learning method can achieve state-of-the-art performance on various link prediction tasks. We propose MAYPL, which learns to initialize representation vectors based on the structure of an HKG and employs an attentive neural message passing consisting of fact-level message computation and entity-centric and relation-centric aggregations, thereby computing the representations based solely on the structure. Due to its structure-driven learning, MAYPL can conduct inductive inferences on new entities and relations. MAYPL outperforms 40 knowledge graph completion methods in 10 datasets, compared with different baseline methods on different datasets to be tested from diverse perspectives.
Paperid:3218
Authors:Raphael Meyer · William Swartworth · David Woodruff
Abstract:
Abstract:We study the computational model where we can access a matrix $\mathbf{A}$ only by computing matrixvector products $\mathbf{A}\mathrm{x}$ for vectors of the form $\mathrm{x} = \mathrm{x}_1 \otimes \cdots \otimes \mathrm{x}_q$. We prove exponential lower bounds on the number of queries needed to estimate various properties, including the trace and the top eigenvalue of $\mathbf{A}$. Our proofs hold for all adaptive algorithms, modulo a mild conditioning assumption on the algorithm's queries. We further prove that algorithms whose queries come from a small alphabet (e.g., $\mathrm{x}_i \in \\{\pm1\\}^n$) cannot test if $\mathbf{A}$ is identically zero with polynomial complexity, despite the fact that a single query using Gaussian vectors solves the problem with probability 1. In steep contrast to the non-Kronecker case, this shows that sketching $\mathbf{A}$ with different distributions of the same subguassian norm can yield exponentially different query complexities. Our proofs follow from the observation that random vectors with Kronecker structure have exponentially smaller inner products than their non-Kronecker counterparts.
Paperid:3219
Authors:Mikhail Mironov · Liudmila Prokhorenkova
Abstract:
The concept of diversity is widely used in various applications: from image or molecule generation to recommender systems. Thus, being able to properly measure diversity is important. This paper addresses the problem of evaluating diversity for a set of objects. First, we make a systematic review of existing diversity measures and explore their undesirable behavior in some cases. Based on this review, we formulate three desirable properties (axioms) of a reliable diversity measure: monotonicity, uniqueness, and continuity. We show that none of the existing diversity measures has all three properties and thus these measures are not suitable for quantifying diversity. Then, we construct two examples of measures that have all the desirable properties, thus proving that the list of axioms is not selfcontradicting. Unfortunately, the constructed examples are too computationally complex for practical use, thus we pose an open problem of constructing a diversity measure that has all the listed properties and can be computed in practice.
Paperid:3220
Authors:Mateusz Olko · Mateusz Gajewski · Joanna Wojciechowska · Mikołaj Morzy · Piotr Sankowski · Piotr Milos
Abstract:
Neural causal discovery methods have recently improved in terms of scalability and computational efficiency. However, our systematic evaluation highlights significant room for improvement in their accuracy when uncovering causal structures. We identify a fundamental limitation: \textit{neural networks cannot reliably distinguish between existing and nonexisting causal relationships in the finite sample regime}. Our experiments reveal that neural networks, as used in contemporary causal discovery approaches, lack the precision needed to recover ground-truth graphs, even for small graphs and relatively large sample sizes. Furthermore, we identify the faithfulness property as a critical bottleneck: (i) it is likely to be violated across any reasonable dataset size range, and (ii) its violation directly undermines the performance of neural discovery methods. These findings lead us to conclude that progress within the current paradigm is fundamentally constrained, necessitating a paradigm shift in this domain.
Paperid:3221
Authors:Zhen Xiang · Linzhi Zheng · Yanjie Li · Junyuan Hong · Qinbin Li · Han Xie · Jiawei Zhang · Zidi Xiong · Chulin Xie · Carl Yang · Dawn Song · Bo Li
Abstract:
The rapid advancement of large language model (LLM) agents has raised new concerns regarding their safety and security, which cannot be addressed by traditional textualharm-focused LLM guardrails. We propose GuardAgent, the first guardrail agent to protect other agents by checking whether the agent actions satisfy safety guard requests. Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then converts this plan into guardrail code for execution. In both steps, an LLM is utilized as the reasoning component, supplemented by in-context demonstrations retrieved from a memory module storing information from previous tasks. GuardAgent can understand different safety guard requests and provide reliable code-based guardrails with high flexibility and low operational overhead. In addition, we propose two novel benchmarks: EICU-AC benchmark to assess the access control for healthcare agents and Mind2Web-SC benchmark to evaluate the safety regulations for web agents. We show that GuardAgent effectively moderates the violation actions for two types of agents on these two benchmarks with over 98% and 83% guardrail accuracies, respectively.
Paperid:3222
Authors:Zhangyi Hu · Jiemin Wu · Hua XU · Mingqian Liao · Ninghui Feng · Bo Gao · Songning Lai · Yutao Yue
Abstract:
Irregular Multivariate Time Series (IMTS) prediction, which suffers from unaligned multichannel signals and massive missing values, is crucial in various areas such as healthcare and biomechanics.Current IMTSspecific methods model temporal and channel patterns in isolation, which, combined with significant missing values, leads to challenging pattern capturing and poor few-shot capability. While foundation models can leverage similar patterns learned from pre-training to enhance representation learning, they are currently limited to regular Univariate Time Series (UTS).Motivated by visual Mask AutoEncoder's (MAE) success in RGB image modeling and UTS prediction, we propose \textbf{VIMTS}, a framework that adapts \textbf{V}isual MAE for \textbf{IMTS} prediction. To jointly model temporal-channel patterns, VIMTS first divides the timeline into equal intervals and unaligned sparse signals into image-like patches, enriching channel representations through cross-channel dependencies. The model then leverages MAE's pre-trained capability in handling sparse multichannel data for patch reconstruction, followed by a coarse-to-fine technique for precise predictions. Additionally, we combine supervised fine-tuning with self-supervised learning to enhance IMTS modeling. Extensive experiments demonstrate VIMTS's superior performance and strong few-shot capability, paving the way for applying visual foundation models in more general time series prediction tasks.
Paperid:3223
Authors:Xiaoling Wu · Junpeng Zhu · Zeng Li
Abstract:
Weight matrix compression has been demonstrated to effectively reduce overfitting and improve the generalization performance of deep neural networks. Compression is primarily achieved by filtering out noisy eigenvalues of the weight matrix.In this work, a novelPopulation Double Bulk (PDB) modelis proposed to characterize the eigenvalue behavior of the weight matrix, which is more general than the existing Population Unit Bulk (PUB) model. Based on PDB model and Random Matrix Theory (RMT), we have discovered a newPDBLS algorithmfor determining the boundary between noisy eigenvalues and information. APDB NoiseFiltering algorithmis further introduced to reduce the rank of the weight matrix for compression. Experiments show that our PDB model fits the empirical distribution of eigenvalues of the weight matrix better than the PUB model, and our compressed weight matrices have lower rank at the same level of test accuracy. In some cases, our compression method can even improve generalization performance when labels contain noise. The code is avaliable at https://anonymous.4open.science/r/PDBLS.
Paperid:3224
Authors:Ivan Skorokhodov · Sharath Girish · Benran Hu · Willi Menapace · Yanyu Li · Rameen Abdal · Sergey Tulyakov · Aliaksandr Siarohin
Abstract:
Abstract:Latent Diffusion Models (LDMs) have emerged as the leading approach for generating highquality images and videos, utilizing compressed latent representations to reduce the computational cost of diffusion processes. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we conduct a spectral analysis of latent spaces in widely used autoencoders and identify a prominent high-frequency component that deviates from natural RGB signals — an effect that becomes even more pronounced in recent autoencoders with a large number of channels. We hypothesize that this high-frequency component reduces the efficiency of the diffusion process and increases generation complexity. To address this issue, we propose a simple yet effective spectral regularization strategy that aligns latent space representations with RGB signals across frequencies by enforcing scale equivariance in the decoder. Our method requires minimal modifications yet significantly improves generation quality, reducing FID by 19\% for image generation on ImageNet-1K $256^2$ and FVD by at least $44$\% for video generation on Kinetics-700 $17 \times 256^2$ after up to $20,000$ autoencoder fine-tuning steps.
Paperid:3225
Authors:Krzysztof Kacprzyk · Julianna Piskorz · M van der Schaar
Abstract:
While blackbox approaches are commonly used for data-driven modeling of dynamical systems, they often obscure a system's underlying behavior and properties, limiting adoption in areas such as medicine and pharmacology. A two-step process of discovering ordinary differential equations (ODEs) and their subsequent mathematical analysis can yield insights into the system's dynamics. However, this analysis may be infeasible for complex equations, and refining the ODE to meet certain behavioral requirements can be challenging. Direct semantic modeling has recently been proposed to address these issues by predicting the system's behavior, such as the trajectory's shape, directly from data, bypassing post-hoc mathematical analysis. In this work, we extend the original instantiation, limited to one-dimensional trajectories and inputs, to accommodate multi-dimensional trajectories with additional personalization, allowing evolution to depend on auxiliary static features (e.g., patient covariates). In a series of experiments, we show how our approach enables practitioners to integrate prior knowledge, understand the dynamics, ensure desired behaviors, and revise the model when necessary.
Paperid:3226
Authors:Alexander Nikulin · Ilya Zisman · Denis Tarasov · Nikita Lyubaykin · Andrei Polubarov · Igor Kiselev · Vladislav Kurenkov
Abstract:
Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pretraining efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by4.2xon average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.
Paperid:3227
Authors:Jacob Bamberger · Benjamin Gutteridge · Scott le Roux · Michael Bronstein · Xiaowen Dong
Abstract:
Longrange graph tasks --- those dependent on interactions between 'distant nodes' --- are an open problem in graph neural network research. Real-world benchmark tasks, especially the Long Range Graph Benchmark (LRGB), have become popular for validating the long-range capability of proposed architectures. However, this is an empirical approach that lacks both robustness and theoretical underpinning;a more principled characterization of the long-range problem is required. To bridge this gap, we introduce a formalrange measurefor operators on graphs, and validate it with synthetic experiments dependent on both long and short ranges. We then leverage our measure to examine commonly used tasks and architectures, and discuss to what extent they are, in fact, long-range.We believe that our work will aid ongoing efforts at defining and tackling the long-range problem on graphs, and that our range measure will prove a useful tool for assessing new datasets and architectures.
Paperid:3228
Authors:Chunhui Zhang · Sean Dae Houlihan · Kwonjoon Lee · Nakul Agarwal · Zhongyu Ouyang · Soroush Vosoughi · Shao-Yuan Lo
Abstract:
Theoryof-mind (ToM) enables humans to infer mental states—such as beliefs, desires, and intentions—forming the foundation of social cognition. Existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning but struggle with scalability in multimodal environments. They remain trapped within the gravitational pull of multi-step planning complexity, failing to generalize as task demands increase. To overcome these limitations, we propose a scalable Bayesian ToM planner. It breaks down ToM complexity into stepwise Bayesian updates. Meanwhile, weak-to-strong control specializes smaller LMs to refine ToM-specific likelihood estimation, transferring their ToM reasoning behavior to larger LMs (7B to 405B) for social and world knowledge integration. This synergistic approach enables scalability, aligning large-model inference human mental states with Bayesian principles. Extensive experiments demonstrate a 4.6% improvement in accuracy over state-of-the-art methods on multimodal ToM benchmarks, including unseen scenarios, establishing a new standard for modeling human mental states in complex environments. Our code and datasets are available: https://anonymous.4open.science/r/scale-bayesian-tom-248B
Paperid:3229
Authors:Mingyang Sun · Pengxiang Ding · Weinan Zhang · Donglin Wang
Abstract:
Diffusion policies have shown promise in learning complex behaviors from demonstrations, particularly for tasks requiring precise control and longterm planning. However, they face challenges in robustness when encountering distribution shifts. This paper explores improving diffusion-based imitation learning models through online interactions with the environment. We propose OTPR (Optimal Transport-guided score-based diffusion Policy for Reinforcement learning fine-tuning), a novel method that integrates diffusion policies with RL using optimal transport theory. OTPR leverages the Q-function as a transport cost and views the policy as an optimal transport map, enabling efficient and stable fine-tuning. Moreover, we introduce masked optimal transport to guide state-action matching using expert keypoints and a compatibility-based resampling strategy to enhance training stability. Experiments on three simulation tasks demonstrate OTPR's superior performance and robustness compared to existing methods, especially in complex and sparse-reward environments. In sum, OTPR provides an effective framework for combining IL and RL, achieving versatile and reliable policy learning.
Paperid:3230
Authors:Urszula Julia Komorowska · Francisco Vargas · Alessandro Rondina · Pietro Lió · Mateja Jamnik
Abstract:
Protein's backbone flexibility is a crucial property that heavily influences its functionality. Recent work in the field of protein diffusion probabilistic modelling has leveraged Normal Mode Analysis (NMA) and, for the first time, introduced information about large scale protein motion into the generative process. However, obtaining molecules with both the desired dynamics and designable quality has proven challenging. In this work, we present NMAtune, a new method that introduces the dynamics information to the protein design stage. NMA-tune uses a trainable component to condition the backbone generation on the lowest normal mode of oscillation. We implement NMA-tune as a plug-and-play extension to RFdiffusion, show that the proportion of samples with high quality structure and the desired dynamics is improved as compared to other methods without the trainable component, and we show the presence of the targeted modes in the Molecular Dynamics simulations.
Paperid:3231
Authors:Longzhu He · Chaozhuo Li · Peng Tang · Sen Su
Abstract:
Graph Neural Networks (GNNs) have demonstrated superior performance in a variety of graph mining and learning tasks. However, when node representations involve sensitive personal information or variables related to individuals, learning from graph data can raise significant privacy concerns. Although recent studies have explored local differential privacy (LDP) to address these concerns, they often introduce significant distortions to graph data, severely degrading learning utility (e.g., classification accuracy). In this paper, we present UPGNet, an LDPbased privacy-preserving graph learning framework that enhances utility while safeguarding user privacy. Specifically, we propose a three-stage pipeline that generalizes the LDP protocols for node features, targeting privacy-sensitive scenarios. Our analysis identifies two key factors that affect the utility of privacy-preserving graph learning: feature dimension and neighborhood size. Based on the above analysis, UPGNet enhances utility by introducing two core layers: High-Order Aggregator (HOA) layer and the Node Feature Regularization (NFR) layer. Extensive experiments on real-world datasets indicate that UPGNet significantly outperforms existing methods in terms of both privacy protection and learning utility.
Paperid:3232
Authors:Paul Mangold · Alain Oliviero Durmus · Aymeric Dieuleveut · Eric Moulines
Abstract:
This paper proposes a novel analysis for the Scaffold algorithm, a popular method for dealing with data heterogeneity in federated learning. While its convergence in deterministic settingswhere local control variates mitigate client drift---is well established, the impact of stochastic gradient updates on its performance is less understood.To address this problem, we first show that its global parameters and control variates define a Markov chain that converges to a stationary distribution in the Wasserstein distance.Leveraging this result, we prove that Scaffold achieves linear speed-up in the number of clients up to higher-order terms in the step size.Nevertheless, our analysis reveals that Scaffold retains a higher-order bias, similar to FedAvg, that does not decrease as the number of clients increases. This highlights opportunities for developing improved stochastic federated learning algorithms.
Paperid:3233
Authors:Hui Zeng · Wenke Huang · Tongqing Zhou · Xinyi Wu · Guancheng Wan · Yingwen Chen · CAI ZHIPING
Abstract:
Turning the multiround vanilla Federated Learning into one-shot FL (OFL) significantly reduces the communication burden and makes a big leap toward practical deployment. However, This work empirically and theoretically unravels that existing OFL falls into a garbage (inconsistent one-shot local models) in and garbage (degraded global model) out pitfall. The inconsistency manifests as divergent feature representations and sample predictions. This work presents a novel OFL framework FAFI that enhances the one-shot training on the client side to essentially overcome inferior local uploading. Specifically, unsupervised feature alignment and category-wise prototype learning are adopted for clients' local training to be consistent in representing local samples. On this basis, FAFI uses informativeness-aware feature fusion and prototype aggregation for global inference. Extensive experiments on three datasets demonstrate the effectiveness of FAFI, which facilitates superior performance compared with 11 OFL baselines (+10.86% accuracy).
Paperid:3234
Authors:Jakob Burkhardt · Hannah Keller · Claudio Orlandi · Chris Schwiegelshohn
Abstract:
We introduce thelineartransformation model, a distributed model of differentially private data analysis. Clients have access to a trusted platform capable of applying a public matrix to their inputs. Such computations can be securely distributed across multiple servers using simple and efficient secure multiparty computation techniques.The linear-transformation model serves as an intermediate model between the highly expressivecentral modeland the minimallocal model. In the central model, clients have access to a trusted platform capable of applying any function to their inputs. However, this expressiveness comes at a cost, as it is often expensive to distribute such computations, leading to the central model typically being implemented by a single trusted server. In contrast, the local model assumes no trusted platform, which forces clients to add significant noise to their data. The linear-transformation model avoids the single point of failure for privacy present in the central model, while also mitigating the high noise required in the local model.We demonstrate that linear transformations are very useful for differential privacy, allowing for the computation of linear sketches of input data. These sketches largely preserve utility for tasks such as private low-rank approximation and private ridge regression, while introducing only minimal error, critically independent of the number of clients.
Paperid:3235
Authors:Ziyan Wang · Zhicheng Zhang · Fei Fang · Yali Du
Abstract:
Abstract:Designing effective reward functions in multiagent reinforcement learning (MARL) is a significant challenge, often leading to suboptimal or misaligned behaviors in complex, coordinated environments. We introduce Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality ($\text{M}^3\text{HF}$), a novel framework that integrates multi-phase human feedback of mixed quality into the MARL training process. By involving humans with diverse expertise levels to provide iterative guidance, $\text{M}^3\text{HF}$ leverages both expert and non-expert feedback to continuously refine agents' policies. During training, we strategically pause agent learning for human evaluation, parse feedback using large language models to assign it appropriately and update reward functions through predefined templates and adaptive weight by using weight decay and performance-based adjustments. Our approach enables the integration of nuanced human insights across various levels of quality, enhancing the interpretability and robustness of multi-agent cooperation. Empirical results in challenging environments demonstrate that $\text{M}^3\text{HF}$ significantly outperforms state-of-the-art methods, effectively addressing the complexities of reward design in MARL and enabling broader human participation in the training process.
Paperid:3236
Authors:Narayan Srinivasan · Matthew Sutton · Christopher Drovandi · Leah South
Abstract:
Abstract:We propose a novel method for measuring the discrepancy between a set of samples and a desired posterior distribution for Bayesian inference. Classical methods for assessing sample quality like the effective sample size are not appropriate for scalable Bayesian sampling algorithms, such as stochastic gradient Langevin dynamics, that are asymptotically biased. Instead, the gold standard is to use the kernel Stein Discrepancy (KSD), which is itself not scalable given its quadratic cost in the number of samples. The KSD and its faster extensions also typically suffer from the curseof-dimensionality and can require extensive tuning. To address these limitations, we develop the polynomial Stein discrepancy (PSD) and an associated goodness-of-fit test. While the new test is not fully convergence-determining, we prove that it detects differences in the first $r$ moments in the Bernstein-von Mises limit. We empirically show that the test has higher power than its competitors in several examples, and at a lower computational cost. Finally, we demonstrate that the PSD can assist practitioners to select hyper-parameters of Bayesian sampling algorithms more efficiently than competitors.
Paperid:3237
Authors:Matteo Sesia · vladimir svetnik
Abstract:
We present a conformal inference method for constructing lower prediction bounds for survival times from rightcensored data, extending recent approaches designed for more restrictive type-I censoring scenarios. The proposed method imputes unobserved censoring times using a machine learning model, and then analyzes the imputed data using a survival model calibrated via weighted conformal inference. This approach is theoretically supported by an asymptotic double robustness property. Empirical studies on simulated and real data demonstrate that our method leads to relatively informative predictive inferences and is especially robust in challenging settings where the survival model may be inaccurate.
Paperid:3238
Authors:Thomas Pethick · Wanyun Xie · Kimon Antonakopoulos · Zhenyu Zhu · Antonio Silveti-Falls · Volkan Cevher
Abstract:
In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a normball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision.
Paperid:3239
Authors:Ricardo Buitrago Ruiz · Albert Gu
Abstract:
Abstract:Recently, recurrent models such as state space models and linear attention have become popular due to their linear complexity in the sequence length. Thanks to their recurrent nature, in principle they can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengthsi.e. they fail to length generalize.In this work, we provide comprehensive empirical and theoretical analysis to support the _unexplored states hypothesis_, which posits that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all _attainable_ states (i.e. states that would be attained if the recurrence was applied to long sequences).Furthermore, we investigate simple training interventions that aim to increase the coverage of the states that the model is trained on, e.g. by initializing the state with Gaussian noise or with the final state of a different input sequence. With only 500 post-training steps, these interventions enable length generalization for sequences that are orders of magnitude longer than the training context. (e.g. $2k\longrightarrow 128k$), thus presenting a simple and efficient way to enable robust length generalization in general recurrent models.
Paperid:3240
Authors:İlker Işık · Ramazan Gokberk Cinbis · Ebru Gol
Abstract:
Language models lack the notion of interchangeable tokens—symbols that are semantically equivalent yet distinct, such as bound variables in formal logic. This limitation prevents generalization to larger vocabularies and hinders the model's ability to recognize alphaequivalence, where renaming bound variables preserves meaning. In this work, we formalize the corresponding machine learning problem, and introduce alpha-covariance, a metric for evaluating robustness to such transformations. To tackle this task, we propose a dual-part token embedding strategy: a shared component ensures semantic consistency, while a randomized component maintains token distinguishability. Compared to a baseline that relies on alpha-renaming for data augmentation, our approach demonstrates improved generalization to unseen tokens in linear temporal logic solving, propositional logic assignment prediction, and copying with an extendable vocabulary, while introducing a favorable inductive bias for alpha-equivalence. Our findings establish a foundation for designing language models that can learn interchangeable token representations, a crucial step toward more flexible and systematic reasoning in formal domains.
Paperid:3241
Authors:Anqi Mao · Mehryar Mohri · Yutao Zhong
Abstract:
Abstract:The problem of learning to defer with multiple experts consists of optimally assigning input instances to experts, balancing the tradeoff between their accuracy and computational cost. This is a critical challenge in natural language generation, but also in other fields such as image processing, and medical diagnostics. Recent studies have proposed surrogate loss functions to optimize deferral, but challenges remain in ensuring their consistency properties. This paper introduces novel surrogate loss functions and efficient algorithms with strong theoretical learning guarantees. We address open questions regarding realizable $H$-consistency, $H$-consistency bounds, and Bayes-consistency for both single-stage (jointly learning predictor and deferral function) and two-stage (learning only the deferral function with a fixed expert) learning scenarios. For single-stage deferral, we introduce a family of new realizable $H$-consistent surrogate losses and further prove $H$-consistency for a selected member. For two-stage deferral, we derive new surrogate losses that achieve realizable $H$-consistency, $H$-consistency bounds, and Bayes-consistency for the two-expert scenario and, under natural assumptions, multiple-expert scenario. Additionally, we provide enhanced theoretical guarantees under low-noise assumptions for both scenarios. Finally, we report the results of experiments using our proposed surrogate losses, comparing their performance against existing baselines.
Paperid:3242
Authors:Shang Liu · Yu Pan · Guanting Chen · Xiaocheng Li
Abstract:
The canonical setup of learning a reward model (RM) from human preferences with binary feedback discards potentially useful samples (such as "tied" between the two responses) and loses finegrained information (such as "slightly better'"). This paper proposes a framework for learning RMs underordinal feedback, generalizing the binary feedback to arbitrary granularity. We first identify a marginal unbiasedness condition, which generalizes the existing assumption of the binary feedback. The condition is validated via the sociological concept called "wisdom of the crowd". Under this condition, we develop a natural probability model and prove the benefits of fine-grained feedback in terms of reducing the Rademacher complexity, which may be of independent interest to another problem: the bias-variance trade-off in knowledge distillation. The framework also sheds light on designing guidelines for human annotators. Our numerical experiments validate that: (1) fine-grained feedback leads to better RM learning for both in- and out-of-distribution settings; (2) incorporating a certain proportion of tied samples boosts RM learning.
Paperid:3243
Authors:Juno Kim · Denny Wu · Jason Lee · Taiji Suzuki
Abstract:
A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to allocate more inferencetime compute to search against a verifier or reward model. This process can then be utilized to refine the pretrained model or distill its reasoning patterns into more efficient models. In this paper, we study inference-time computation by viewing chain-of-thought (CoT) generation as a metastable Markov process: easy reasoning steps (e.g., algebraic manipulations) form densely connected clusters, while hard reasoning steps (e.g., applying a relevant theorem) create sparse, low-probability edges between clusters, leading to phase transitions at longer timescales. Under this framework, we prove that implementing a search protocol that rewards sparse edges improves CoT by decreasing the expected number of steps to reach different clusters. In contrast, we establish a limit on reasoning capability when the model is restricted to local information of the pretrained graph. We also show that the information gained by search can be utilized to obtain a better reasoning model: (1) the pretrained model can be directly finetuned to favor sparse edges via policy gradient methods, and moreover (2) a compressed \emph{metastable representation} of the reasoning dynamics can be distilled into a smaller, more efficient model.
Paperid:3244
Authors:Santhosh Karnik · Anna Veselovska · Mark Iwen · Felix Krahmer
Abstract:
We provide a rigorous analysis of implicit regularization in an overparametrized tensor factorization problem beyond the lazy training regime. For matrix factorization problems, this phenomenon has been studied in a number of works. A particular challenge has been to design universal initialization strategies which provably lead to implicit regularization in gradientdescent methods. At the same time, it has been argued by Cohen et. al. 2016 that more general classes of neural networks can be captured by considering tensor factorizations. However, in the tensor case, implicit regularization has only been rigorously established for gradient flow or in the lazy training regime. In this paper, we prove the first tensor result of its kind for gradient descent rather than gradient flow. We focus on the tubal tensor product and the associated notion of low tubal rank, encouraged by the relevance of this model for image data. We establish that gradient descent in an overparametrized tensor factorization model with a small random initialization exhibits an implicit bias towards solutions of low tubal rank. Our theoretical findings are illustrated in an extensive set of numerical simulations show-casing the dynamics predicted by our theory as well as the crucial role of using a small random initialization.
Paperid:3245
Authors:Huanzhi Mao · Fanjia Yan · Charlie Ji · Vishnu Suresh · Shishir G. Patil · Ion Stoica · Joseph E Gonzalez
Abstract:
Function calling, also known as tool use, refers to an LLM's ability to invoke external functions, APIs, or predefined tools in response to user queries—an essential capability for agentic LLM applications. Despite the importance of function calling, there isn't a standard benchmark to evaluate function calling abilities in a wide range of realworld settings. We present the Function Calling Leaderboard (FCL), a comprehensive benchmark designed to evaluate function calling capabilities in a wide range of settings. The FCL benchmark evaluates serial and parallel function calls, across various programming languages using a novel Abstract Syntax Tree (AST) evaluation method that can easily scale to thousands of functions. We construct the benchmark using a combination of manually curated and user contributed tasks. Finally, FCL benchmark evaluates the ability of models to abstain and reason in stateful multi-step agentic setting. Evaluating a large range of models we find interesting trends in model function calling abilities. With FCL we are able to assess the function-calling capabilities of different models across various categories, compare their performance in both prompting and native function-calling modes, and quantify potential training data contamination.
Paperid:3246
Authors:Joan Serrà · Recep Oguz Araz · Dmitry Bogdanov · Yuki Mitsufuji
Abstract:
Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to match them at the segment level (e.g., 20s chunks). In addition, existing approaches resort to classification and triplet losses, disregarding more recent losses that could bring meaningful improvements. In this paper, we propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms wellstudied alternatives. The former is based on pairwise segment distance reductions, while the latter modifies an existing loss following decoupling, hyper-parameter, and geometric considerations. With these two elements, we do not only achieve state-of-the-art results in the standard track-level evaluation, but we also obtain a breakthrough performance in a segment-level evaluation. We believe that, due to the generality of the challenges addressed here, the proposed methods may find utility in domains beyond audio or musical version matching.
Paperid:3247
Authors:Zhilong Zhang · Tian Xu · Xinghao Du · Xingchen Cao · Yihao Sun · Yang Yu
Abstract:
In sequential decisionmaking, the reward function serves as the primary supervision signal, guiding agents to acquire the desired behaviors. Traditional reward modeling methods rely heavily on human expertise, limiting their scalability. Automated preference generation from suboptimal demonstrations has emerged as a promising alternative to address this limitation. This approach first generates preference data from suboptimal demonstrations and then trains reward models based on these preferences. Despite its potential, existing methods often struggle to generate preference data with sufficient coverage, limiting the accuracy and generalizability of the resulting reward models. To overcome this limitation, we propose APEC (Automated Preference generation with Enhanced Coverage), a novel method that improves the coverage of preference data. APEC achieves this by selecting policy pairs with significantly different iteration indices from the whole adversarial imitation learning process. We provide a theoretical analysis to validate that the selected policy pairs provably hold preference relationships. Experimental results demonstrate that APEC consistently outperforms baseline methods in generating preferences with broader coverage across both vector-based and pixel-based control tasks. Consequently, the reward models trained with APEC align more closely with ground-truth rewards, deriving improved policy performance.
Paperid:3248
Authors:Dantong Niu · Yuvan Sharma · Haoru Xue · Giscard Biamby · Junyi Zhang · Ziteng Ji · Trevor Darrell · Roi Herzig
Abstract:
Foundation models pretrained on massive unlabeled datasets have revolutionized natural language and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the importance of pre-training. Yet, efforts in robotics have struggled to achieve similar success, limited by either the need for costly robotic annotations or the lack of representations that effectively model the physical world. In this paper, we introduce ARM4R, anAuto-regressiveRoboticModel that leverages low-level4DRepresentations learned from human video data to yield a better pre-trained robotic model. Specifically, we focus on utilizing 3D point tracking representations from videos derived by lifting 2D representations into 3D space via monocular depth estimation across time. These 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, enabling efficient transfer learning from human video data to low-level robotic control. Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.
Paperid:3249
Authors:Daria Lioubashevski · Tomer Schlank · Gabriel Stanovsky · Ariel Goldstein
Abstract:
Uncovering the inner mechanisms of Transformer models offers insights into how they process and represent information. In this work, we analyze the computation performed by Transformers in the layers after the top1 prediction remains fixed, known as the “saturation event”. We expand this concept to top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these events occur in order of the corresponding tokens’ ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different variants, and even in untrained Transformers. We propose that these events reflect task transitions, where determining each token corresponds to a discrete task. We show that it is possible to predict the current task from hidden layer embedding, and demonstrate that we can cause the model to switch to the next task via intervention. Leveraging our findings, we introduce a token-level early-exit strategy, surpassing existing methods in balancing performance and efficiency and show how to exploit saturation events for better language modeling.
Paperid:3250
Authors:Shimin Zhang · Ziyuan Ye · Yinsong Yan · Zeyang Song · Yujie Wu · Jibin Wu
Abstract:
Determining the similarity between dynamical systems remains a longstanding challenge in both machine learning and neuroscience. Recent works based on Koopman operator theory have proven effective in analyzing dynamical similarity by examining discrepancies in the Koopman spectrum. However, they are limited in analyzing the similarity between highly nonlinear systems with cross-scale coupled dynamics, primarily due to the absence of an effective mechanism for capturing multi-scale dynamics when approximating the Koopman spectrum. To overcome this limitation, we propose a novel framework named KoopSTD that approximatesKoopmanSpectrum of dynamical systems withTimescaleDecoupling. To achieve a precise description of the underlying dynamics, KoopSTD further filters out spurious modes, thereby ensuring a faithful metric of dynamical similarity. Our extensive experiments on physical and neural systems validate the effectiveness, scalability, and robustness of the proposed KoopSTD, demonstrating its superiority over existing similarity metrics. Finally, we apply KoopSTD to explore two open-ended research questions in neuroscience and large language models, highlighting its potential to facilitate future scientific and engineering discoveries.
Paperid:3251
Authors:Evripidis Bampis · Bruno Escoffier · Dimitris Fotakis · Panagiotis Patsilinakos · Michalis Xefteris
Abstract:
Abstract:We consider a learning augmented framework for NPhard permutation problems. The algorithm has access to predictions telling, given a pair $u,v$ of elements, whether $u$ is before $v$ or not in an optimal solution. Building on the work of Braverman and Mossel (SODA 2008), we show that for a class of optimization problems including scheduling, network design and other graph permutation problems, these predictions allow to solve them in polynomial time with high probability, provided that predictions are true with probability at least $1/2+\epsilon$. Moreover, this can be achieved with a parsimonious access to the predictions.
Paperid:3252
Authors:Wei Zhuo · Han Yu · Guang Tan · Xiaoxiao Li
Abstract:
Graph Neural Networks (GNNs) have shown remarkable success in learning from graphstructured data. However, their application to directed graphs (digraphs) presents unique challenges, primarily due to the inherent asymmetry in node relationships. Traditional GNNs are adept at capturing unidirectional relations but fall short in encoding the mutual path dependencies between nodes, such as asymmetrical shortest paths typically found in digraphs. Recognizing this gap, we introduceCommuteGraphNeuralNetworks (CGNN), an approach that seamlessly integrates node-wise commute time into the message passing scheme. The cornerstone of CGNN is an efficient method for computing commute time using a newly formulated digraph Laplacian. Commute time is then integrated into the neighborhood aggregation process, with neighbor contributions weighted according to their respective commute time to the central node in each layer. It enables CGNN to directly capture the mutual, asymmetric relationships in digraphs. Extensive experiments confirm the superior performance of CGNN. Source code of CGNN is anonymously available here.
Paperid:3253
Authors:Ismail Alkhouri · Shijun Liang · Cheng-Han Huang · Jimmy Dai · Qing Qu · Saiprasad Ravishankar · Rongrong Wang
Abstract:
Diffusion models (DMs) are a class of generative models that allow sampling from a distribution learned over a training set. When applied to solving inverse imaging problems (IPs), the reverse sampling steps of DMs are typically modified to approximately sample from a measurementconditioned distribution in the image space. However, these modifications may be unsuitable for certain settings (such as in the presence of measurement noise) and non-linear tasks, as they often struggle to correct errors from earlier sampling steps and generally require a large number of optimization and/or sampling steps. To address these challenges, we state three conditions for achieving measurement-consistent diffusion trajectories. Building on these conditions, we propose a new optimization-based sampling method that not only enforces the standard data manifold measurement consistency and forward diffusion consistency, as seen in previous studies, but also incorporates step-wise and network-regularized backward diffusion consistency that maintains a diffusion trajectory by optimizing over the input of the pre-trained model at every sampling step. By enforcing these conditions, either implicitly or explicitly, our sampler requires significantly fewer reverse steps. Therefore, we refer to our accelerated method as \textbf{S}tep-w\textbf{i}se \textbf{T}riple-\textbf{Co}nsistent Sa\textbf{m}pling (\textbf{SITCOM}). Compared to existing state-of-the-art baseline methods, under different levels of measurement noise, our extensive experiments across six linear and three non-linear image restoration and reconstruction tasks demonstrate that SITCOM achieves competitive or superior results in terms of standard image similarity metrics and required run-time.
Paperid:3254
Authors:Yilin Ye · Junchao Huang · Xingchen ZENG · Jiazhi Xia · Wei Zeng
Abstract:
Crossmodal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple modalities. This paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models.
Paperid:3255
Authors:Shib S Dasgupta · Michael Boratko · Andrew McCallum
Abstract:
Personalized item recommendation typically suffers from data sparsity, which is most often addressed by learning vector representations of users and items via lowrank matrix factorization. While this effectively densifies the matrix by assuming users and movies can be represented by linearly dependent latent features, it does not capture more complicated interactions. For example, vector representations struggle with set-theoretic relationships, such as negation and intersection, e.g. recommending a movie that is “comedy and action, but not romance”. In this work, we formulate the problem of personalized item recommendation as matrix completion where rows are set-theoretically dependent. To capture this set-theoretic dependence we represent each user and attribute by a hyperrectangle or box (i.e. a Cartesian product of intervals). Box embeddings can intuitively be understood as trainable Venn diagrams, and thus not only inherently represent similarity (via the Jaccard index), but also naturally and faithfully support arbitrary set-theoretic relationships. Queries involving set-theoretic constraints can be efficiently computed directly on the embedding space by performing geometric operations on the representations. We empirically demonstrate the superiority of box embeddings over vector-based neural methods on both simple and complex item recommendation queries by up to 30% overall.
Paperid:3256
Authors:Enze Xie · Junsong Chen · Yuyang Zhao · Jincheng YU · Ligeng Zhu · Yujun Lin · Zhekai Zhang · Muyang Li · Junyu Chen · Han Cai · Bingchen Liu · Zhou Daquan · Song Han
Abstract:
This paper presents SANA1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss. (3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity, enabling smaller models to match larger model quality at inference time. Through these strategies, SANA-1.5 achieves a text-image alignment score of 0.72 on GenEval, which can be further improved to 0.80 through inference scaling, establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality, making high-quality image generation more accessible.
Paperid:3257
Authors:Unai Fischer Abaigar · Christoph Kern · Juan Perdomo
Abstract:
Machine learning is increasingly used in government programs to identify and support the most vulnerable individuals, prioritizing assistance for those at greatest risk over optimizing aggregate outcomes. This paper examines the welfare impacts of prediction in equitydriven contexts, and how they compare to other policy levers, such as expanding bureaucratic capacity. Through mathematical models and a real-world case study on long-term unemployment amongst German residents, we develop a comprehensive understanding of the relative effectiveness of prediction in surfacing the worst-off. Our findings provide clear analytical frameworks and practical, data-driven tools that empower policymakers to make principled decisions when designing these systems.
Paperid:3258
Authors:Han-Byul Kim · Duc Hoang · Arnav Kundu · Mohammad Samragh · Minsik Cho
Abstract:
With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, SyncPoint Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20\% overall inference latency reduction with < 1\% accuracy regression for LLaMA2-70B inference over 8 GPUs.
Paperid:3259
Authors:Jihwan Oh · Minchan Jeong · Jongwoo Ko · Se-Young Yun
Abstract:
Abstract:Large Language Models (LLMs) solve complex problems using trainingfree methods like prompt engineering and in-context learning, yet ensuring reasoning correctness remains challenging. While self-correction methods such as self-consistency and self-refinement aim to improve reliability, they often reinforce biases due to the lack of effective feedback mechanisms. Multi-Agent Debate (MAD) has emerged as an alternative, but we identify two key limitations: bias reinforcement, where debate amplifies model biases instead of correcting them, and lack of perspective diversity, as all agents share the same model and reasoning patterns, limiting true debate effectiveness. To systematically evaluate these issues, we introduce $\textit{MetaNIM Arena}$, a benchmark designed to assess LLMs in adversarial strategic decision-making, where dynamic interactions influence optimal decisions. To overcome MAD’s limitations, we propose $\texttt{\textbf{DReaMAD}}$ ($\textbf{D}$iverse $\textbf{Rea}$soning via $\textbf{M}$ulti-$\textbf{A}$gent $\textbf{D}$ebate with Refined Prompt), a novel framework that (1) refines LLMs’ strategic prior knowledge to improve reasoning quality and (2) promotes diverse viewpoints within a single model by systematically modifying prompts, reducing bias. Empirical results show that $\texttt{\textbf{DReaMAD}}$ significantly improves decision accuracy, reasoning diversity, and bias mitigation across multiple strategic tasks, establishing it as a more effective approach for LLM-based decision-making.
Paperid:3260
Authors:Ke Zhu · Jianing Chu · Ilya Lipkovich · Wenyu Ye · Shu Yang
Abstract:
Individualized treatment rules/recommendations (ITRs) aim to improve patient outcomes by tailoring treatments to the characteristics of each individual. However, in highdimensional treatment settings, existing methods face significant challenges due to data sparsity within treatment groups and highly unbalanced covariate distributions across groups. To address these challenges, we propose a novel calibration-weighted treatment fusion procedure that robustly balances covariates across treatment groups and fuses similar treatments using a penalized working model. The fusion procedure ensures the recovery of latent treatment group structures when either the calibration model or the outcome model is correctly specified. In the fused treatment space, practitioners can seamlessly apply state-of-the-art ITR learning methods with the flexibility to utilize a subset of covariates, thereby achieving robustness while addressing practical concerns such as fairness. We establish theoretical guarantees, including consistency, the oracle property of treatment fusion, and regret bounds when integrated with multi-armed ITR learning methods such as policy trees. Simulation studies show superior group recovery and policy value compared to existing approaches. We illustrate the practical utility of our method using EHR-derived data from patients with Chronic Lymphocytic Leukemia and Small Lymphocytic Lymphoma.
Paperid:3261
Authors:Louis Serrano · Armand Kassaï Koupaï · Thomas Wang · Pierre ERBACHER · patrick gallinari
Abstract:
Solving timedependent parametric partial differential equations (PDEs) is challenging for data-driven methods, as these models must adapt to variations in parameters such as coefficients, forcing terms, and initial conditions. State-of-the-art neural surrogates perform adaptation through gradient-based optimization and meta-learning to implicitly encode the variety of dynamics from observations. This often comes with increased inference complexity. Inspired by the in-context learning capabilities of large language models (LLMs), we introduce Zebra, a novel generative auto-regressive transformer designed to solve parametric PDEs without requiring gradient adaptation at inference. By leveraging in-context information during both pre-training and inference, Zebra dynamically adapts to new tasks by conditioning on input sequences that incorporate context example trajectories. As a generative model, Zebra can be used to generate new trajectories and allows quantifying the uncertainty of the predictions. We evaluate Zebra across a variety of challenging PDE scenarios, demonstrating its adaptability, robustness, and superior performance compared to existing approaches.
Paperid:3262
Authors:Tingchen Fu · Mrinank Sharma · Phil Torr · Shay Cohen · David Krueger · Fazl Barez
Abstract:
Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 22 widelyused models. Our findings reveal concerning trends: (1) Scaling up parameter size does not always enhance resilience against poisoning attacks and the influence on model resilience varies among different model suites. (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.
Paperid:3263
Authors:Yifan HAO · xingyuan pan · Hanning Zhang · Chenlu Ye · Rui Pan · Tong Zhang
Abstract:
Supervised finetuning (SFT) on domain-specific data is the dominant approach for adapting foundation models to specialized tasks. However, it has been observed that SFT models tend to forget knowledge acquired during pretraining. In vision models, ensembling a pretrained model with its fine-tuned counterpart has been shown to mitigate this issue. In this work, we demonstrate that the same holds for language models, and, more strikingly, we observe an overadaptation phenomenon: the ensemble model not only retains general knowledge from the foundation model but also outperforms the fine-tuned model even on the fine-tuning domain itself.Despite the empirical success of ensembling, a theoretical understanding of its benefits remains underexplored. We develop a formal theoretical analysis of the overadaptation phenomenon. Ensembling mitigates this by balancing two primary sources of error: bias, caused by insufficient fine-tuning, and variance, introduced by overfitting to fine-tuning data. While regularization techniques aim to address this trade-off, we show that ensembling provides a more effective solution. We analyze this phenomenon in over-parameterized linear settings and demonstrate that interpolating between pretrained and fine-tuned weights significantly improves performance. These findings offer theoretical justification for the observed advantages of model ensembling, supported by empirical experiments consistent with our analysis.
Paperid:3264
Authors:Yogesh Verma · Amauri Souza · Vikas Garg
Abstract:
The local inductive bias of messagepassing graph neural networks (GNNs) hampers their ability to exploit key structural information (e.g., connectivity and cycles). Positional encoding (PE) and Persistent Homology (PH) have emerged as two promising approaches to mitigate this issue. PE schemes endow GNNs with location-aware features, while PH methods enhance GNNs with multiresolution topological features. However, a rigorous theoretical characterization of the relative merits and shortcomings of PE and PH has remained elusive. We bridge this gap by establishing that neither paradigm is more expressive than the other, providing novel constructions where one approach fails but the other succeeds. Our insights inform the design of a novel learnable method, PiPE (short for ‘Persistence informed by PE’), that is provably more expressive than PH and PE. PiPE demonstrates strong performance across a variety of tasks (e.g., molecule property prediction, graph classification, and out-of-distribution generalization), thereby advancing the frontiers of graph representation learning.
Paperid:3265
Authors:David Fleischer · David A Stephens · Archer Yang
Abstract:
We propose a computationally efficient alternative to generalized random forests (GRFs) for estimating heterogeneous effects in large dimensions. While GRFs rely on a gradientbased splitting criterion, which is large dimensions is computationally expensive and unstable, our method introduces a fixed-point approximation that eliminates the need for Jacobian estimation. This gradient-free approach preserves GRF’s theoretical guarantees of consistency and asymptotic normality while significantly improving computational efficiency. We demonstrate that our method achieves multiple times the speed over standard GRFs without compromising statistical accuracy. Experiments on both simulated and real-world data, validate our approach. Our findings suggest that the proposed method is a scalable alternative for localized effect estimation in machine learning and causal inference applications.
Paperid:3266
Authors:Yiting Liu · Zhi-Hong Deng
Abstract:
Incontext learning has become a standard approach for utilizing language models.However, selecting and processing suitable demonstration examples can be challenging and time-consuming, especially when dealing with a large number of examples.We propose exploring the potential of activation space using Iterative Vectors (IVs), a technique designed to enhance in-context performance by simulating gradient updates during inference.IVs extract and iteratively refine the gradients within a language model, applying them during inference without requiring backpropagation at any stage.We evaluate IVs across various tasks using four popular models and observe significant improvements.Our findings suggest that in-context activation steering can be a promising direction, opening new avenues for future research.
Paperid:3267
Authors:Dan Busbridge · Amitis Shidani · Floris Weers · Jason Ramapuram · Etai Littwin · Russell Webb
Abstract:
We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which canimprove our understanding of distillation and inform experimental design.
Paperid:3268
Authors:Kevin Tan · Wei Fan · Yuting Wei
Abstract:
Abstract:Actorcritic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work has successfully learned an $\epsilon$-optimal policy with a sample complexity of $O(1/\epsilon^2)$ trajectories with general function approximation when strategic exploration is necessary. We address this open problem by introducing a novel actor-critic algorithm that attains a sample-complexity of $O(dH^5 \log|\mathcal{A}|/\epsilon^2 + d H^4 \log|\mathcal{F}|/ \epsilon^2)$ trajectories, and accompanying $\sqrt{T}$ regret when the Bellman eluder dimension $d$ does not increase with $T$ at more than a $\log T$ rate. Here, $\mathcal{F}$ is the critic function class, and $\mathcal{A}$ is the action space. Our algorithm integrates optimism, off-policy critic estimation targeting the optimal Q-function, and rare-switching policy resets. We extend this to the setting of Hybrid RL, showing that initializing the critic with offline data yields sample efficiency gains compared to purely offline or online RL.
Paperid:3269
Authors:Xiaole Zhang · Peiyu Zhang · Xiongye Xiao · Shixuan Li · Vasileios Tzoumas · Vijay Gupta · Paul Bogdan
Abstract:
Integerorder calculus often falls short in capturing the long-range dependencies and memory effects found in many real-world processes. Fractional calculus addresses these gaps via fractional-order integrals and derivatives, but fractional-order dynamical systems pose substantial challenges in system identification and optimal control due to the lack of standard control methodologies. In this paper, we theoretically derive the optimal control via linear quadratic regulator (LQR) for fractional-order linear time-invariant (FOLTI) systems and develop an end-to-end deep learning framework based on this theoretical foundation. Our approach establishes a rigorous mathematical model, derives analytical solutions, and incorporates deep learning to achieve data-driven optimal control of FOLTI systems. Our key contributions include: (i) proposing an innovative system identification method control strategy for FOLTI systems, (ii) developing the first end-to-end data-driven learning framework, Fractional-Order Learning for Optimal Control (FOLOC), that learns control policies from observed trajectories, and (iii) deriving a theoretical analysis of sample complexity to quantify the number of samples required for accurate optimal control in complex real-world problems. Experimental results indicate that our method accurately approximates fractional-order system behaviors without relying on Gaussian noise assumptions, pointing to promising avenues for advanced optimal control.
Paperid:3270
Authors:Zhiyong Wang · Chen Yang · John C. S. Lui · Dongruo Zhou
Abstract:
In this work, we study offline reinforcement learning (RL) with zeroshot generalization property (ZSG), where the agent has access to an offline dataset including experiences from different environments, and the goal of the agent is to train a policy over the training environments which performs well on test environments without further interaction. Existing work showed that classical offline RL fails to generalize to new, unseen environments. We propose pessimistic empirical risk minimization (PERM) and pessimistic proximal policy optimization (PPPO), which leverage pessimistic policy evaluation to guide policy learning and enhance generalization. We show that both PERM and PPPO are capable of finding a near-optimal policy with ZSG. Our result serves as a first step in understanding the foundation of the generalization phenomenon in offline reinforcement learning.
Paperid:3271
Authors:Siqi Guo · Ilgee Hong · Vicente Balmaseda · Changlong Yu · Liang Qiu · Xin Liu · Haoming Jiang · Tuo Zhao · Tianbao Yang
Abstract:
Abstract:Supervised finetuning (SFT) followed by preference optimization (PO) denoted by SFT$\textrightarrow$PO has become the standard for improving pretrained large language models (LLMs), with PO demonstrating significant performance gains. However, PO methods rely on either human-labeled preference data or a strong reward model to generate preference data.*Can we fine-tune LLMs without preference data or reward models while achieving competitive performance to SFT$\textrightarrow$PO?*We address this question by introducing Discriminative Fine-Tuning (DFT), a novel approach that eliminates the need for preference data. Unlike SFT, which employs a generative approach and overlooks negative data, DFT adopts a discriminative paradigm that pushes up the likelihood of positive answers while suppressing negative ones.Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT’s effectiveness, achieving performance comparable to if not better than SFT$\textrightarrow$PO.
Paperid:3272
Authors:Yabo Liu · Jinghua Wang · Waikeung Wong · Xiaoling Luo · Chengliang Liu · Yong Xu
Abstract:
Segment Anything Model (SAM) has demonstrated remarkable zeroshot segmentation capabilities across various visual tasks. However, its performance degrades significantly when deployed in new target domains with substantial distribution shifts. While existing self-training methods based on fixed teacher-student architectures have shown improvements, they struggle to ensure that the teacher network consistently outperforms the student under severe domain shifts. To address this limitation, we propose a novel Collaborative Mutual Learning Framework for source-free SAM adaptation, leveraging dual-networks in a dynamic and cooperative manner. Unlike fixed teacher-student paradigms, our method dynamically assigns the "teacher" and "student" roles by evaluating the reliability of each collaborative network in each training iteration. Our framework incorporates a dynamic mutual learning mechanism with three key components: a direct alignment loss for knowledge transfer, a reverse distillation loss to encourage diversity, and a triplet relationship loss to refine feature representations. These components enhance the adaptation capabilities of the collaborative networks, enabling them to generalize effectively to target domains while preserving their pre-trained knowledge. Extensive experiments on diverse target domains demonstrate that our proposed framework achieves state-of-the-art adaptation performance. Our code is in the appendix and will be publicly available.
Paperid:3273
Authors:Yuwei Niu · Shuo He · Qi Wei · Zongyu Wu · Feng Liu · Lei Feng
Abstract:
While multimodal contrastive learning methods (e.g., CLIP) can achieve impressive zeroshot classification performance, recent research has revealed that these methods are vulnerable to backdoor attacks. To defend against backdoor attacks on CLIP, existing defense methods focus on either the pre-training stage or the fine-tuning stage, which would unfortunately cause high computational costs due to numerous parameter updates and are not applicable in black-box settings. In this paper, we provide the first attempt at a computationally efficient backdoor detection method to defend against backdoored CLIP in the inference stage. We empirically find that the visual representations of backdoored images are insensitive to benign and malignant changes in class description texts. Motivated by this observation, we propose BDetCLIP, a novel test-time backdoor detection method based on contrastive prompting. Specifically, we first prompt a language model (e.g., GPT-4) to produce class-related description texts (benign) and class-perturbed random texts (malignant) by specially designed instructions. Then, the distribution difference in cosine similarity between images and the two types of class description texts can be used as the criterion to detect backdoor samples. Extensive experiments validate that our proposed BDetCLIP is superior to state-of-the-art backdoor detection methods, in terms of both effectiveness and efficiency.
Paperid:3274
Authors:Zherui Li · Houcheng Jiang · Hao Chen · Baolong Bi · Zhenhong Zhou · Fei Sun · Junfeng Fang · Xiang Wang
Abstract:
Large language models (LLMs) acquire information from pretraining corpora, but their stored knowledge can become inaccurate or outdated over time. Model editing addresses this challenge by modifying model parameters without retraining, and prevalent approaches leverage hypernetworks to generate these parameter updates. However, they face significant challenges in lifelong editing due to their incompatibility with LLM parameters that dynamically change during the editing process. To address this, we observed that hypernetwork-based lifelong editing aligns with reinforcement learning modeling and proposedRLEdit, an RL-based editing method. By treating editing losses as rewards and optimizing hypernetwork parameters at the full knowledge sequence level, we enable it to precisely capture LLM changes and generate appropriate parameter updates. Our extensive empirical evaluation across several LLMs demonstrates that RLEdit outperforms existing methods in lifelong editing with superior effectiveness and efficiency, achieving a59.24%improvement while requiring only2.11%of the time compared to most approaches.
Paperid:3275
Authors:Zhiyang Chen · Hailong Yao · Xia Yin
Abstract:
Multiobjective optimization problems arise widely in various fields. In practice, multi-objective optimization is generally solved by heuristics with tunable parameters that are highly application-specific. Tuning parameters based on real-world instances (a.k.a. algorithm configuration) are generally empirical without theoretical guarantees. In this work, we establish the theoretical foundation of data-driven multi-objective optimization through the lens of machine learning theory. We provide generalization guarantees on selecting parameters for multi-objective optimization algorithms based on sampled problem instances. Moreover, if the performance metric of the algorithm is the Pareto volume, we can PAC-learn the approximately optimal configuration in polynomial time. We apply our framework to various algorithms, including approximation algorithms, local search, and linear programming. Experiments on multiple problems verify our theoretical findings.
Paperid:3276
Authors:Zeyu Fang · Ming Gu · Sheng Zhou · Jiawei Chen · Qiaoyu Tan · Haishuai Wang · Jiajun Bu
Abstract:
Unsupervised Anomaly Detection (UAD) plays a crucial role in identifying abnormal patterns within data without labeled examples, holding significant practical implications across various domains. Although the individual contributions of representation learning and clustering to anomaly detection are wellestablished, their interdependencies remain under-explored due to the absence of a unified theoretical framework. Consequently, their collective potential to enhance anomaly detection performance remains largely untapped. To bridge this gap, in this paper, we propose a novel probabilistic mixture model for anomaly detection to establish a theoretical connection among representation learning, clustering, and anomaly detection. By maximizing a novel anomaly-aware data likelihood, representation learning and clustering can effectively reduce the adverse impact of anomalous data and collaboratively benefit anomaly detection. Meanwhile, a theoretically substantiated anomaly score is naturally derived from this framework. Lastly, drawing inspiration from gravitational analysis in physics, we have devised an improved anomaly score that more effectively harnesses the combined power of representation learning and clustering. Extensive experiments, involving 17 baseline methods across 30 diverse datasets, validate the effectiveness and generalization capability of the proposed method, surpassing state-of-the-art methods.
Paperid:3277
Authors:Xingyu Miao · Haoran Duan · Yang Long · Jungong Han
Abstract:
Score Distillation Sampling (SDS) has emerged as a prominent method for textto-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes of generation and editing within SDS and its variants have unified underlying gradient terms. Building on this insight, we propose Unified Distillation Sampling (UDS), a method that seamlessly integrates both the generation and editing of 3D assets. Essentially, UDS refines the gradient terms used in vanilla SDS methods, unifying them to support both tasks. Extensive experiments demonstrate that UDS not only outperforms baseline methods in generating 3D assets with richer details but also excels in editing tasks, thereby bridging the gap between 3D generation and editing.
Paperid:3278
Authors:Leonard Papenmeier · Matthias Poloczek · Luigi Nardi
Abstract:
Recent work reported that simple Bayesian optimization methods perform well for highdimensional real-world tasks, seemingly contradicting prior work and tribal knowledge. This paper investigates the ‘why’.’ We identify fundamental challenges that arise in high-dimensionalBayesian optimization and explain why recent methods succeed. Our analysis shows that vanishing gradients caused by Gaussian process initialization schemes have a major role in the failures of high-dimensional Bayesian optimization and that methods that promote local search behaviors are better suited for the task. We find that maximum likelihood estimation of Gaussian process length scales suffices for state-of-the-art performance. Based on this, we propose a simple variant of maximum likelihood estimation called MSR that takes into account these findings. We evaluate MSR on a comprehensive set of real-world applications and a series of targeted experiments that illustrate and confirm our findings.
Paperid:3279
Authors:Zhe Zhao · HaiBin Wen · Pengkun Wang · ShuangWang · Zhenkun Wang · Qingfu Zhang · Yang Wang
Abstract:
Longtailed distribution datasets are prevalent in many machine learning tasks, yet existing neural network models still face significant challenges when handling such data. This paper proposes a novel adaptive pruning strategy, LTAP (Long-Tailed Adaptive Pruner), aimed at balancing model efficiency and performance to better address the challenges posed by long-tailed data distributions. LTAP introduces multi-dimensional importance scoring criteria and designs a dynamic weight adjustment mechanism to adaptively determine the pruning priority of parameters for different classes. By focusing on protecting parameters critical for tail classes, LTAP significantly enhances computational efficiency while maintaining model performance. This method combines the strengths of long-tailed learning and neural network pruning, overcoming the limitations of existing approaches in handling imbalanced data. Extensive experiments demonstrate that LTAP outperforms existing methods on various long-tailed datasets, achieving a good balance between model compression rate, computational efficiency, and classification accuracy. This research provides new insights into solving model optimization problems in long-tailed learning and is significant for improving the performance of neural networks on imbalanced datasets. The code is available at \url{https://anonymous.4open.science/r/AEFCDAISJ/README.md}.
Paperid:3280
Authors:Dixian Zhu · Tianbao Yang · Livnat Jerby
Abstract:
Regression is a fundamental task in machine learning that has garnered extensive attention over the past decades. The conventional approach for regression involves employing loss functions that primarily concentrate on aligning model prediction with the ground truth for each individual data sample. Recent research endeavors have introduced novel perspectives by incorporating label similarity into regression through the imposition of additional pairwise regularization or contrastive learning on the latent feature space, demonstrating their effectiveness. However, there are two drawbacks to these approaches: (i) their pairwise operations in the latent feature space are computationally more expensive than conventional regression losses; (ii) they lack theoretical insights behind these methods. In this work, we propose GAR (Gradient Aligned Regression) as a competitive alternative method in label space, which is constituted by a conventional regression loss and two pairwise label difference losses for gradient alignment including magnitude and direction. GAR enjoys: i) the same level efficiency as conventional regression loss because the quadratic complexity for the proposed pairwise losses can be reduced to linear complexity; ii) theoretical insights from learning the pairwise label difference to learning the gradient of the ground truth function. We limit our current scope as regression on the clean data setting without noises, outliers or distributional shifts, etc. We demonstrate the effectiveness of the proposed method practically on two synthetic datasets and on eight extensive realworld tasks from six benchmark datasets with other eight competitive baselines. Running time experiments demonstrate the superior efficiency of the proposed GAR compared to existing methods with pairwise regularization or contrastive learning in the latent feature space. Additionally, ablation studies confirm the effectiveness of each component of GAR.
Paperid:3281
Authors:Yuena Lin · Haichun Cai · Jun-Yi Hang · Haobo Wang · Zhen Yang · Gengyu Lyu
Abstract:
Graph contrastive learning (GCL) aims at narrowing positives while dispersing negatives, often causing a minority of samples with great similarities to gather as a small group. It results in two latent shortcomings in GCL: 1)local cohesionthat a class cluster contains numerous independent small groups, and 2)global sparsenessthat these small groups (or isolated samples) dispersedly distribute among all clusters. These shortcomings make the learned distributiononly focus on local similarities among partial samples, which hinders the ability to capture the ideal global structural properties among real clusters, especially high intracluster compactness and inter-cluster separateness. Considering this, we design a novel fuzzy boundary by extending the original cluster boundary with fuzzy set theory, which involves fuzzy boundary construction and fuzzy boundary contraction to address these shortcomings. The fuzzy boundary construction dilates the original boundaries to bridge the local groups, and the fuzzy boundary contraction forces the dispersed samples or groups within the fuzzy boundary to gather tightly, jointly mitigating local cohesion and global sparseness while forming the ideal global structural distribution. Extensive experiments demonstrate that a graph auto-encoder with the fuzzy boundary significantly outperforms current state-of-the-art GCL models in both downstream tasks and quantitative analysis.
Paperid:3282
Authors:Kai Liu · bowen xu · Shaoyu Wu · Xin Chen · Hao Zhou · Yongliang Tao · lulu hu
Abstract:
Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding timeconsuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (LayerwiseRotatedSparseActivation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40\% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30× wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54\%, while surpassing TEAL by 1.77\% and CATS by 17.14\%.
Paperid:3283
Authors:Connor Lawless · Lily Weng · Berk Ustun · Madeleine Udell
Abstract:
Machine learning models are designed to predict outcomes using features about an individual, but fail to take into account how individuals can change them. Consequently, models can assign fixed predictions that deny individuals recourse to change their outcome. This work develops a new paradigm to identify fixed predictions by finding confined regions in which all individuals receive fixed predictions. We introduce the first method, ReVer, for this task, using tools from mixedinteger quadratically constrained programming. Our approach certifies recourse for out-of-sample data, provides interpretable descriptions of confined regions, and runs in seconds on real world datasets. We conduct a comprehensive empirical study of confined regions across diverse applications. Our results highlight that existing pointwise verification methods fail to discover confined regions, while ReVer provably succeeds.
Paperid:3284
Authors:Vivienne Huiling Wang · Tinghuai Wang · Joni Pajarinen
Abstract:
Hierarchical reinforcement learning (HRL) learns to make decisions on multiple levels of temporal abstraction. A key challenge in HRL is that the lowlevel policy changes over time, making it difficult for the high-level policy to generate effective subgoals. To address this issue, the high-level policy must capture a complex subgoal distribution while also accounting for uncertainty in its estimates. We propose an approach that trains a conditional diffusion model regularized by a Gaussian Process (GP) prior to generate a complex variety of subgoals while leveraging principled GP uncertainty quantification. Building on this framework, we develop a strategy that selects subgoals from both the diffusion policy and GP's predictive mean. Our approach outperforms prior HRL methods in both sample efficiency and performance on challenging continuous control benchmarks.
Paperid:3285
Authors:Yihan Du · Anna Winnicki · Gal Dalal · Shie Mannor · R Srikant
Abstract:
Abstract:Standard reinforcement learning (RL) assumes that an agent can observe a reward for each stateaction pair. However, in practical applications, it is often difficult and costly to collect a reward for each state-action pair. While there have been several works considering RL with trajectory feedback, it is unclear if trajectory feedback is inefficient for learning when trajectories are long. In this work, we consider a model named RL with segment feedback, which offers a general paradigm filling the gap between per-state-action feedback and trajectory feedback. In this model, we consider an episodic Markov decision process (MDP), where each episode is divided into $m$ segments, and the agent observes reward feedback only at the end of each segment. Under this model, we study two popular feedback settings: binary feedback and sum feedback, where the agent observes a binary outcome and a reward sum according to the underlying reward function, respectively. To investigate the impact of the number of segments $m$ on learning performance, we design efficient algorithms and establish regret upper and lower bounds for both feedback settings. Our theoretical and experimental results show that: under binary feedback, increasing the number of segments $m$ decreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasing $m$ does not reduce the regret significantly.
Paperid:3286
Authors:Yang Li · Yuchao Ma · Qi Qi
Abstract:
Online advertising is a vital revenue source for major internet platforms. Recently, joint advertising, which assigns a bundle of one store and one brand in an ad slot rather than allocating a single store, has emerged as an effective method for enhancing allocation efficiency and revenue. However, existing mechanisms for joint advertising fail to realize the optimality, as they tend to focus on individual advertisers and overlook bundle structures. This paper identifies an optimal mechanism for joint advertising in a singleslot setting. For multi-slot joint advertising, we propose BundleNet, a novel bundle-based neural network approach specifically designed for joint advertising. Our extensive experiments demonstrate that the mechanisms generated by BundleNet approximate the theoretical analysis results in the single-slot setting and achieve state-of-the-art performance in the multi-slot setting. This significantly increases platform revenue while ensuring approximate dominant strategy incentive compatibility and individual rationality.
Paperid:3287
Authors:Yuqin Dai · Zhouheng Yao · Chunfeng Song · Qihao Zheng · Weijian Mai · Kunyu Peng · Shuai Lu · Wanli Ouyang · Jian Yang · Jiamin Wu
Abstract:
Brain decoding aims to reconstruct visual perception of human subject from fMRI signals, which is crucial for understanding brain's perception mechanisms. Existing methods are confined to the singlesubject paradigm due to substantial brain variability, which leads to weak generalization across individuals and incurs high training costs, exacerbated by limited availability of fMRI data. To address these challenges, we propose MindAligner, an explicit functional alignment framework for cross-subject brain decoding from limited fMRI data. The proposed MindAligner enjoys several merits. First, we learn a Brain Transfer Matrix (BTM) that projects the brain signals of an arbitrary new subject to one of the known subjects, enabling seamless use of pre-trained decoding models. Second, to facilitate reliable BTM learning, a Brain Functional Alignment module is proposed to perform soft cross-subject brain alignment under different visual stimuli with a multi-level brain alignment loss, uncovering fine-grained functional correspondences with high interpretability. Experiments indicate that MindAligner not only outperforms existing methods in visual decoding under data-limited conditions, but also provides valuable neuroscience insights in cross-subject functional analysis. The code will be made publicly available.
Paperid:3288
Authors:Guopeng Lin · Ruisheng Zhou · Shuyu Chen · Weili Han · Jin Tan · Wenjing Fang · Lei Wang · Tao Wei
Abstract:
Abstract:Knearest neighbors (KNN) classification plays a significant role in various applications due to its interpretability. The accuracy of KNN classification relies heavily on large amounts of high-quality data, which are often distributed among different parties and contain sensitive information. Dozens of privacy-preserving frameworks have been proposed for performing KNN classification with data from different parties while preserving data privacy. However, existing privacy-preserving frameworks for KNN classification demonstrate communication inefficiency in the online phase due to two main issues: (1) They suffer from huge communication sizes for secure Euclidean square distance computations. (2) They require numerous communication rounds to select the $k$ nearest neighbors. These two issues significantly impede their practical deployment.In this paper, we present Kona, an efficient privacy-preserving framework for KNN classification. We resolve the above communication issues by (1) designing novel Euclidean triples, which eliminate the online communication for secure Euclidean square distance computations, (2) proposing a divide-and-conquer bubble protocol, which significantly reduces communication rounds for selecting the $k$ nearest neighbors. Experimental results on eight widely used datasets demonstrate that Kona significantly outperforms the state-of-the-art framework by $1.1\times \sim 3121.2\times$ in communication size, $19.1\times \sim 5783.2\times$ in communication rounds, and $1.1\times \sim 232.6\times$ in runtime.
Paperid:3289
Authors:Mingyu Jin · Kai Mei · Wujiang Xu · Mingjie Sun · Ruixiang Tang · Mengnan Du · Zirui Liu · Yongfeng Zhang
Abstract:
Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show for the first time that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformerbased LLMs. Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (i.e., knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model’s parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE) and it appears since very first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The code is available at https://anonymous.4open.science/r/RopewithLLM-F102.
Paperid:3290
Authors:Ojash Neopane · Aarti Singh · Aaditya Ramdas
Abstract:
Estimation and inference for the Average Treatment Effect (ATE) is a cornerstone of causal inference and often serves as the foundation for more advanced procedures. While traditionally analyzed in a batch setting, recent advances in martingale theory have paved the way for adaptive methods that can enhance the power of downstream inference. Despite these advances, progress in understanding and developing adaptive algorithms remains in its early stages. Existing works either focus on asymptotic analyses that overlook explorationexploitation tradeoffs relevant in finite-sample regimes or rely on suboptimal estimation procedures.In this work, we address these limitations by studying adaptive sampling procedures leveraging the asymptotically optimal Augmented Inverse Probability Weighting (AIPW) estimator. Our analysis uncovers challenges obscured by asymptotic approaches and introduces a novel algorithmic design principle reminiscent of optimism in multi-armed bandits. This principled approach enables our algorithm to achieve significant theoretical and empirical gains compared to prior methods. Our findings mark a step forward in advancing adaptive causal inference methods in theory and practice.
Paperid:3291
Authors:Ziqian Zhang · Bohan Yang · Lihe Li · Yuqi Bian · Ruiqi Xue · Feng Chen · Yi-Chen Li · lei yuan · Yang Yu
Abstract:
Abstract:The policy trained by reinforcement learning (RL) makes decisions based on state features collected from sensors. It is common for state features to evolve for reasons such as periodic sensor maintenance or the addition of new sensors for performance improvement. The deployed policy fails in new state space when state features are unseen during training. Previous work tackles this challenge by training a sensorinvariant policy or generating multiple policies and selecting the appropriate one with limited samples. However, both directions struggle to guarantee the performance when faced with unpredictable evolutions. In this paper, we formalize this problem as state evolvable reinforcement learning (SERL), where the agent is required to mitigate policy degradation after state evolutions without costly exploration. We propose **Lapse** by reusing policies learned from the old state space in two distinct aspects. On one hand, Lapse directly reuses the *robust* old policy by composing it with a learned state reconstruction model to handle vanishing sensors. On the other hand, the behavioral experience from the old policy is reused by Lapse to train a newly adaptive policy through offline learning, better utilizing new sensors. To leverage advantages of both policies in different scenarios, we further propose *automatic ensemble weight adjustment* to effectively aggregate them. Theoretically, we justify that robust policy reuse helps mitigate uncertainty and error from both evolution and reconstruction. Empirically, Lapse achieves a significant performance improvement, outperforming the strongest baseline by about $2\times$ in benchmark environments.
Paperid:3292
Authors:Liming Shen · Liang Deng · Chongke Bi · Yu Wang · Xinhai Chen · Yueqing Wang · Jie Liu
Abstract:
Implicit neural representation (INR) has now been thrust into the limelight with its flexibility in highfidelity flow field reconstruction tasks. However, the lack of standard benchmarking datasets and the grid independence assumption for INR-based methods hinder progress and adoption in real-world simulation scenarios. Moreover, naive adoptions of existing INR frameworks suffer from limited accuracy in capturing fine-scale structures and spatiotemporal dynamics. Tacking these issues, we first introduce HFR-Beach, a 5.4 TB public large-scale CFD dataset with 33,600 unsteady 2D and 3D vector fields for reconstructing high-fidelity flow fields. We further present PEINR, a physics-enhanced INR framework, to enrich the flow fields by concurrently enhancing numerical-precision and grid-resolution. Specifically, PEINR is mainly composed of physical encoding and transformer-based spatiotemporal fuser (TransSTF). Physical encoding decouples temporal and spatial components, employing Gaussian coordinate encoding and localized encoding techniques to capture the nonlinear characteristics of spatiotemporal dynamics and the stencil discretization of spatial dimensions, respectively. TransSTF fuses both spatial and temporal information via transformer for capturing long-range temporal dependencies. Qualitative and quantitative experiments and demonstrate that PEINR outperforms state-of-the-art INR-based methods in reconstruction quality.
Paperid:3293
Authors:Disen Lan · Weigao Sun · Jiaxi Hu · Jusen Du · Yu Cheng
Abstract:
Transformers with linear recurrent modeling offer lineartime training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents \textbf{Liger}, short for \underline{\textbf{li}}nearizing LLMs with \underline{\textbf{g}}at\underline{\textbf{e}}d \underline{\textbf{r}}ecurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at: \url{https://anonymous.4open.science/r/Liger-anonymous-36A2}.
Paperid:3294
Authors:Alessandro Favero · Antonio Sclocchi · Francesco Cagnetta · Pascal Frossard · Matthieu Wyart
Abstract:
Natural data is often organized as a hierarchical combination of features. How many samples are needed for a generative model to learn the rules governing how these features are composed, so as to produce a combinatorial number of novel data? What signal in the data is exploited to learn? We investigate these questions both theoretically and empirically. Theoretically, we consider diffusion models trained on simple probabilistic contextfree grammars - tree-like graphical models used to represent the structure of data such as language and images. We demonstrate that diffusion models learn compositional rules with the sample complexity required for clustering together features with statistically similar contexts. This clustering emerges hierarchically: higher-level, more abstract structures require more data to be identified. This mechanism leads to a sample complexity that scales polynomially, and not exponentially, with the dimension of the structure considered. We thus predict that diffusion models trained on intermediate dataset size generate data that exhibit local coherence up to a certain scale, but lack global coherence. Finally, we design empirical methods to measure coherence systematically in data generated by diffusion models trained in different domains, including images and texts. We find remarkable agreement with our predictions: both generated text and images achieve progressively larger coherence lengths as the training time or dataset size grows.
Paperid:3295
Authors:Adam Breuer
Abstract:
In this paper, we provide the first practical algorithms with provable guarantees for the problem of inferring the topics assigned to each document in an LDA topic model. This is the primary inference problem for many applications of topic models in social science, data exploration, and causal inference settings. We obtain this result by showing a novel nongradient-based, combinatorial approach to estimating topic models. This yields algorithms that converge to near-optimal posterior probability in logarithmic parallel computation time (adaptivity)---exponentially faster than any known LDA algorithm. We also show that our approach can provide interpretability guarantees such that each learned topic is formally associated with a known keyword. Finally, we show that unlike alternatives, our approach can maintain the independence assumptions necessary to use the learned topic model for downstream causal inference methods that allow researchers to study topics as treatments. In terms of practical performance, our approach consistently returns solutions of higher semantic quality than solutions from state-of-the-art LDA algorithms, neural topic models, and LLM-based topic models across a diverse range of text datasets and evaluation parameters.
Paperid:3296
Authors:Jiajun Chen · Jin Tian · Christopher J Quinn
Abstract:
Artificial intelligence plays a significant role in society, which can impact people with legal or ethical consequences such as in healthcare, law enforcement, education, and hiring. Despite the numerous fairness criteria that have been proposed in the machine learning community, there remains limited investigation into fairness as defined through sensitive attributes (e.g., gender, race, or age) in a sequential decisionmaking framework. In this paper, we focus on causal logistic bandit problems where the learner seeks to make fair decisions, under a notion of fairness that accounts for counterfactual reasoning. We propose and analyze an algorithm by leveraging primal-dual optimization for constrained causal logistic bandits where the non-linear constraints are priori unknown and must be learned in time. We obtain sub-linear regret guarantees with leading terms similar to the lower bound for the unconstrained problem while guaranteeing sub-linear constraint violations. We show how to achieve zero cumulative constraint violations with a small increase in the regret bound.
Paperid:3297
Authors:Shivam Kumar · Yun Yang · Lizhen Lin
Abstract:
In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a highdimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Our rates depend solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings.
Paperid:3298
Authors:Seungyul Han · Sunwoo Lee · Jaebak Hwang · Yonghyeon Jo
Abstract:
Traditional robust methods in multiagent reinforcement learning (MARL) often struggle against coordinated adversarial attacks in cooperative scenarios. To address this limitation, we propose the Wolfpack Adversarial Attack framework, inspired by wolf hunting strategies, which targets an initial agent and its assisting agents to disrupt cooperation. Additionally, we introduce the Wolfpack-Adversarial Learning for MARL (WALL) framework, which trains robust MARL policies to defend against the proposed Wolfpack attack by fostering system-wide collaboration. Experimental results underscore the devastating impact of the Wolfpack attack and the significant robustness improvements achieved by WALL.
Paperid:3299
Authors:Taesoo Kim · Jinju Kim · Dong Kim · Jong Hwan Ko · Gyeong-Moon Park
Abstract:
The rapid advancement of ZeroShot Text-to-Speech (ZS-TTS) technology has enabled high-fidelity voice synthesis from minimal audio cues, raising significant privacy and ethical concerns. Despite the threats to voice privacy, research to selectively remove the knowledge to replicate unwanted individual voices from pre-trained model parameters has not been explored. In this paper, we address the new challenge of speaker identity unlearning for ZS-TTS systems. To meet this goal, we propose the first machine unlearning frameworks for ZS-TTS, especially Teacher-Guided Unlearning (TGU), designed to ensure the model forgets designated speaker identities while retaining its ability to generate accurate speech for other speakers. Our proposed methods incorporate randomness to prevent consistent replication of forget speakers' voices, assuring unlearned identities remain untraceable. Additionally, we propose a new evaluation metric, speaker-Zero Retrain Forgetting (spk-ZRF). This assesses the model's ability to disregard prompts associated with forgotten speakers, effectively neutralizing its knowledge of these voices. The experiments conducted on the state-of-the-art model demonstrate that TGU prevents the model from replicating forget speakers' voices while maintaining high quality for other speakers.The demo is available at https://speechunlearn.github.io/ .
Paperid:3300
Authors:Zhongweiyang Xu · Xulin Fan · Zhong-Qiu Wang · Xilin Jiang · Romit Roy Choudhury
Abstract:
Blind Speech Separation (BSS) aims to separatemultiple speech sources from audio mixturesrecorded by a microphone array. The problem ischallenging because it is a blind inverse problem,i.e., the microphone array geometry, the roomimpulse response (RIR), and the speech sources,are all unknown. We propose ArrayDPS tosolve the BSS problem in an unsupervised,arrayagnostic, and generative manner. The coreidea builds on diffusion posterior sampling (DPS),but unlike DPS where the likelihood is tractable,ArrayDPS must approximate the likelihood byformulating a separate optimization problem.The solution to the optimization approximatesroom acoustics and the relative transfer functionsbetween microphones. These approximations,along with the diffusion priors, iterate throughthe ArrayDPS sampling process and ultimatelyyield separated voice sources. We only need asimple single-speaker speech diffusion model asa prior along with the mixtures recorded at themicrophones; no microphone array information isnecessary. Evaluation results show that ArrayDPSoutperforms all baseline unsupervised methodswhile being comparable to supervised methodsin terms of SDR. Audio demos are provided at:https://arraydps.github.io/ArrayDPSDemo/.
Paperid:3301
Authors:CHUANQI CHENG · Jian Guan · Wei Wu · Rui Yan
Abstract:
Longform video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLaMP, a hierarchical video-language model that processes hour-long videos at ``mixed precision'' through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLaMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLaMP's superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLaMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance.
Paperid:3302
Authors:Artavazd Maranjyan · El Mehdi Saad · Peter Richtarik · Francesco Orabona
Abstract:
Asynchronous methods are fundamental for parallelizing computations in distributed machine learning. They aim to accelerate training by fully utilizing all available resources. However, their greedy approach can lead to inefficiencies using more computation than required, especially when computation times vary across devices. If the computation times were known in advance, training could be fast and resourceefficient by assigning more tasks to faster workers. The challenge lies in achieving this optimal allocation without prior knowledge of the computation time distributions. In this paper, we propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of worker computation times. Through rigorous theoretical analysis, we show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times. Experimental results further demonstrate that ATA is resource-efficient, significantly reducing costs compared to the greedy approach, which can be arbitrarily expensive depending on the number of workers.
Paperid:3303
Authors:Cheol Kim · Jai Moondra · Shresth Verma · Madeleine Pollack · Lingkai Kong · Milind Tambe · Swati Gupta
Abstract:
Abstract:In many realworld applications of Reinforcement Learning (RL), deployed policies have varied impacts on different stakeholders, creating challenges in reaching consensus on how to effectively aggregate their preferences. Generalized $p$-means form a widely used class of social welfare functions for this purpose, with broad applications in fair resource allocation, AI alignment, and decision-making. This class includes well-known welfare functions such as Egalitarian, Nash, and Utilitarian welfare. However, selecting the appropriate social welfare function is challenging for decision-makers, as the structure and outcomes of optimal policies can be highly sensitive to the choice of $p$. To address this challenge, we study the concept of an $\alpha$-approximate portfolio in RL, a set of policies that are approximately optimal across the family of generalized $p$-means for all $p \in [-\infty, 1]$. We propose algorithms to compute such portfolios and provide theoretical guarantees on the trade-offs among approximation factor, portfolio size, and computational efficiency. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of our approach in summarizing the policy space induced by varying $p$ values, empowering decision-makers to navigate this landscape more effectively.
Paperid:3304
Authors:Yuan Li · Zhengzhong Liu · Eric Xing
Abstract:
Optimizing data mixtures for supervised finetuning (SFT) of large language models (LLMs) is critical for developing general-purpose models, yet this area remains underexplored. In this paper, we frame data mixing as an optimization problem and introduce a novel method designed to minimize validation loss. Our approach parametrizes the loss by modeling effective data transferred and leveraging scaling laws for fine-tuning. By experimenting with various small-scale data mixtures, we fit these parameters and derive the optimal weights. We provide both mathematical proofs and empirical results demonstrating that our algorithm achieves excellent overall and individual performance across all domains. Through controlled experiments, we show that models trained with our optimized weights perform on par with those using optimal weights determined via grid search, with per-domain loss only 0.66\% higher than the best domain loss from grid search on average. Additionally, we show that reweighting popular SFT datasets using our method improves both validation loss and downstream performance. Finally, we discuss how our method can generalize to guide data selection for domain-specific models and provide insights into SFT.
Paperid:3305
Authors:Guan Zhong · Likang Wu · Hongke Zhao · Ming He · Jianping Fan
Abstract:
Attention mechanisms are critical to the success of large language models (LLMs), driving significant advancements in multiple fields. However, for graphstructured data, which requires emphasis on topological connections, they fall short compared to message-passing mechanisms on fixed links, such as those employed by Graph Neural Networks (GNNs). This raises a question: ``Does attention fail for graphs in natural language settings?'' Motivated by these observations, we embarked on an empirical study from the perspective of attention mechanisms to explore how LLMs process graph-structured data. The goal is to gain deeper insights into the attention behavior of LLMs over graph structures. Through a series of experiments, we uncovered unique phenomena regarding how LLMs apply attention to graph-structured data and analyzed these findings to improve the modeling of such data by LLMs. The primary findings of our research are: 1) While LLMs can recognize graph data and capture text-node interactions, they struggle to model inter-node relationships within graph structures due to inherent architectural constraints. 2) The attention distribution of LLMs across graph nodes does not align with ideal structural patterns, indicating a failure to adapt to graph topology nuances. 3) Neither fully connected attention (as in LLMs) nor fixed connectivity (as in GNNs) is optimal; each has specific limitations in its application scenarios. Instead, intermediate-state attention windows improve LLM training performance and seamlessly transition to fully connected windows during inference. Source code: \href{https://anonymous.4open.science/r/LLM_exploration-B21F}{anonymous.4open.science/LLM_exploration-B21F}
Paperid:3306
Authors:Ziyan Li · Naoki Hiratani
Abstract:
Continual learning of multiple tasks remains a major challenge for neural networks. Here, we investigate how task order influences continual learning and propose a strategy for optimizing it. Leveraging a linear teacherstudent model with latent factors, we derive an analytical expression relating task similarity and ordering to learning performance. Our analysis reveals two principles that hold under a wide parameter range: (1) tasks should be arranged from the least representative to the most typical, and (2) adjacent tasks should be dissimilar. We validate these rules on both synthetic data and real-world image classification datasets (Fashion-MNIST, CIFAR-10, CIFAR-100), demonstrating consistent performance improvements in both multilayer perceptrons and convolutional neural networks. Our work thus presents a generalizable framework for task-order optimization in task-incremental continual learning.
Paperid:3307
Authors:Sunny Sanyal · Hayden Prairie · Rudrajit Das · Ali Kavis · Sujay Sanghavi
Abstract:
Abstract:Finetuning a pre-trained model on a downstream task often degrades its original capabilities, a phenomenon known as "catastrophic forgetting". This is especially an issue when one does not have access to the data and recipe used to develop the pre-trained model. Under this constraint, most existing methods for mitigating forgetting are inapplicable. To address this challenge, we propose a *sample weighting scheme for the fine-tuning data* solely based on the pre-trained model's losses. Specifically, we upweight the easy samples on which the pre-trained model's loss is low and vice versa to limit the drift from the pre-trained model. Our approach is orthogonal and yet complementary to existing methods; while such methods mostly operate on parameter or gradient space, we concentrate on the sample space.We theoretically analyze the impact of fine-tuning with our method in a linear setting, showing that it stalls learning in a certain subspace which inhibits overfitting to the target task. We empirically demonstrate the efficacy of our method on both language and vision tasks. As an example, when fine-tuning Gemma 2 2B on MetaMathQA, our method results in only a $0.8$% drop in accuracy on GSM8K (another math dataset) compared to standard fine-tuning, while preserving $5.4$% more accuracy on the pre-training datasets.
Paperid:3308
Authors:Xiaoyu Wang · Wenhao Yang · Chang Yao · Mingli Song · Yuanyu Wan
Abstract:
Abstract:Although the differential privacy (DP) of decentralized online learning has garnered considerable attention recently, existing algorithms are unsatisfactory due to their inability to achieve $(\epsilon, 0)$DP over $T$ rounds, recover the optimal regret in the non-private case, and maintain the lightweight computation under complex constraints. To address these issues, we first propose a new decentralized online learning algorithm satisfying $(\epsilon, 0)$-DP over $T$ rounds, and show that it can achieve $\widetilde{O}(n(\rho^{-1/4}+\epsilon^{-1}\rho^{1/4})\sqrt{T})$ and $\widetilde{O}(n (\rho^{-1/2}+\epsilon^{-1}))$ regret bounds for convex and strongly convex functions respectively, where $n$ is the number of local learners and $\rho$ is the spectral gap of the communication matrix. As long as $\epsilon=\Omega(\sqrt{\rho})$, these bounds match existing lower bounds in the non-private case up to polylogarithmic factors, which implies that $(\epsilon, 0)$-DP of decentralized online learning may be ensured nearly for free. Our key idea for these improvements is to design a block-decoupled accelerated gossip strategy that can be incorporated with the classical tree-based private aggregation, and also enjoys a faster average consensus among local learners. Furthermore, we develop a projection-free variant of our algorithm, which keeps the efficiency even under complex constraints. As a trade-off, the above regret bounds degrade to $\widetilde{O}(n(T^{3/4}+\epsilon^{-1}T^{1/4}))$ and $\widetilde{O}(n(T^{2/3}+\epsilon^{-1}))$ respectively, which however are even better than the existing private centralized projection-free online algorithm.
Paperid:3309
Authors:Yuli Chen · Bo Cheng · Jiale Han · Yingying Zhang · Yingting Li · Shuhao Zhang
Abstract:
Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward nonuniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code\footnote{The code is available at: \url{https://anonymous.4open.science/r/DLP}.} to facilitate future research.
Paperid:3310
Authors:Dongyeop Lee · Jinseok Chung · Kwanhee Lee · Namhoon Lee
Abstract:
Abstract:Sparsifying neural networks often suffer from seemingly inevitable performance degradation, and it remains challenging to restore the original performance despite much recent progress.Motivated by recent studies in robust optimization, we aim to tackle this problem by finding subnetworks that are both sparse and flat at the same time.Specifically, we formulate pruning as a sparsityconstrained optimization problem where flatness is encouraged as an objective.We solve it explicitly via an augmented Lagrange dual approach and extend it further by proposing a generalized projection operation, resulting in novel pruning methods called SAFE and its extension, SAFE$^+$.Extensive evaluations on standard image classification and language modeling tasks reveal that SAFE consistently yields sparse networks with improved generalization performance, which compares competitively to strong alternatives.We further present that SAFE exhibits practically desirable properties such as robustness to noisy data and insensitivity to target sparsity, rendering its advantages for efficient and general use.
Paperid:3311
Authors:Jie Hu · Yi-Ting Ma · Do-Young Eun
Abstract:
Abstract:We propose a *historydriven target (HDT)* framework in Markov Chain Monte Carlo (MCMC) to improve any random walk algorithm on discrete state spaces, such as general undirected graphs, for efficient sampling from target distribution $\\boldsymbol{\\mu}$. With broad applications in network science and distributed optimization, recent innovations like the self-repellent random walk (SRRW) achieve near-zero variance by prioritizing under-sampled states through transition kernel modifications based on past visit frequencies. However, SRRW's reliance on explicit computation of transition probabilities for all neighbors at each step introduces substantial computational overhead, while its strict dependence on time-reversible Markov chains excludes advanced non-reversible MCMC methods. To overcome these limitations, instead of direct modification of transition kernel, HDT introduces a history-dependent target distribution $\\boldsymbol{\\pi}[\\mathbf{x}]$ to replace the original target $\\boldsymbol{\\mu}$ in any graph sampler, where $\\mathbf{x}$ represents the empirical measure of past visits. This design preserves lightweight implementation by requiring only local information between the current and proposed states and achieves compatibility with both reversible and non-reversible MCMC samplers, while retaining unbiased samples with target distribution $\\boldsymbol{\\mu}$ and near-zero variance performance. Extensive experiments in graph sampling demonstrate consistent performance gains, and a memory-efficient Least Recently Used (LRU) cache ensures scalability to large general graphs.
Paperid:3312
Authors:Seungwook Han · Jinyeop Song · Jeff Gore · Pulkit Agrawal
Abstract:
Autoregressive transformers exhibit adaptive learning through incontext learning (ICL), which begs the question of how. Prior works have shown that transformers represent the ICL tasks as vectors in their representations. In this paper, we leverage the encoding-decoding framework to study how transformers form task vectors during pretraining and how their task encoding quality predicts ICL task performance. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of task encoding and decoding. As the model learns to encode different latent tasks (e.g., ``Finding the first noun in a sentence.") into distinct, separable representations, it concurrently builds conditional decoding algorithms and improves its ICL performance. We validate this phenomenon across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B) and over the course of pretraing in OLMo-7B. Further, we demonstrate that the quality of task encoding inferred from representations predicts ICL performance, and that, surprisingly, finetuning the earlier layers can improve the task encoding and performance more than finetuning the latter layers. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.
Paperid:3313
Authors:Chi Gao · Lukai Li · YANCHENG ZHOU · Shangqi Guo
Abstract:
Link prediction is a fundamental problem for networkstructured data. However, the prevalent research paradigm tends to assume abundant observed links, overlooking more challenging scenarios with a scarcity of observed links, resulting in insufficient sample sizes. In real-world networks, hierarchical modularity, characterized by structured and nested connections, remains robust even with sparsely observed links. To address the challenge of limited link samples, we propose leveraging hierarchical modularity as a prior structure. We introduce complete-tree (CT) space, a discrete metric space with latent complete-tree structures, to formalize hierarchical modularity with an emphasis on its hierarchical permutation symmetry. Utilizing the group theory to quantize and compare permutation symmetries of different spaces, we prove that the CT space provides a significantly lower bound on sample complexity than the commonly used Euclidean space. We develop leaf matching, a data-efficient network embedding that maps nodes onto the CT space and conducts discrete optimization by reducing it to decentralized search. Experiments verify the data efficiency of CT space over other spaces. Moreover, leaf matching outperforms the state-of-the-art graph transformer in data-scarce scenarios while exhibiting excellent scalability.
Paperid:3314
Authors:Liangze Jiang · Damien Teney
Abstract:
Outof-distribution (OOD) generalization is challenging because distribution shifts come in many forms. Numerous algorithms exist to address specific settings, butchoosing the right training algorithm for the right datasetwithout trial and error is difficult. Indeed, real-world applications often involve multiple types and combinations of shifts that are difficult to analyze theoretically.Method.This work explores the possibility oflearningthe selection of a training algorithm for OOD generalization. We propose a proof of concept (OOD-Chameleon) that formulates the selection as a multi-label classification over candidate algorithms, trained on adataset of datasetsrepresenting a variety of shifts. We evaluate the ability of OOD-Chameleon to rank algorithms on unseen shifts and datasets based only on dataset characteristics, i.e., without training models first, unlike traditional model selection.Findings.Extensive experiments show that the learned selector identifies high-performing algorithms across synthetic, vision, and language tasks. Further inspection shows that it learns non-trivial decision rules, which provides new insights into the applicability of existing algorithms. Overall, this new approach opens the possibility of better exploiting and understanding the plethora of existing algorithms for OOD generalization.
Paperid:3315
Authors:Jaemoo Choi · Jaewoong Choi · Dohyun Kwon
Abstract:
We address the convergence problem in learning the Optimal Transport (OT) map, where the OT Map refers to a map from one distribution to another while minimizing the transport cost. Semidual Neural OT, a widely used approach for learning OT Maps with neural networks, often generates fake solutions that fail to transfer one distribution to another accurately. We identify a sufficient condition under which the max-min solution of Semi-dual Neural OT recovers the true OT Map. Moreover, to address cases when this sufficient condition is not satisfied, we propose a novel method, OTP, which learns both the OT Map and the Optimal Transport Plan, representing the optimal coupling between two distributions. Under sharp assumptions on the distributions, we prove that our model eliminates the fake solution issue and correctly solves the OT problem. Our experiments show that the OTP model recovers the optimal transport map where existing methods fail and outperforms current OT-based models in image-to-image translation tasks. Notably, the OTP model can learn stochastic transport maps when deterministic OT Maps do not exist, such as one-to-many tasks like colorization.
Paperid:3316
Authors:Kuo Yang · Zhengyang Zhou · Qihe Huang · Wenjie Du · Limin Li · Wu Jiang · Yang Wang
Abstract:
The outof-distribution (OOD) generalization challenge is a longstanding problem in graph learning. Through studying the fundamental cause of data distribution shift, i.e., the changes of environments, significant progress has been achieved in addressing this issue. However, we observe that existing works still fail to effectively address complex environment shifts. Existing practices place excessive attention on extracting causal subgraphs, inevitably treating spurious subgraphs as environment variables. While spurious subgraphs are controlled by environments, the space of environment changes encompass more than the scale of spurious subgraphs. Therefore, existing efforts have a limited inference space for environments, leading to failure under severe environment changes. To tackle this issue, we propose a negative inference graph OOD framework (NeGo) to broaden the inference space for environment factors. Inspired by the successful practice of prompt learning in capturing underlying semantics and causal associations in large language models, we design a negative prompt environment inference to extract underlying environment information. We further introduce the environment-enhanced invariant subgraph learning to effectively exploit inferred environment embedding, ensuring the robust extraction of causal subgraph in the environment shifts. Lastly, we conduct a comprehensive evaluation of NeGo on real-world datasets and synthetic datasets across domains. NeGo outperforms baselines on nearly all datasets, which verify the effectiveness of our framework.
Paperid:3317
Authors:Yihao Yang · Wenke Huang · Guancheng Wan · Bin Yang · Mang Ye
Abstract:
Federated ParameterEfficient Fine-Tuning aims to adapt Vision-Language Models for downstream tasks in distributed environments. However, data heterogeneity across participants hinders collaborative effectiveness, necessitating personalized adaptation to cover distinct data distributions. Current personalized methods suffer from two limitations. 1) Textual Property Loss: Existing methods facilitate the collaboration between decoupled prompts at the feature level, which potentially undermines the textual properties of the prompts. 2) Visual Feature Diversity: The diversity of visual features makes it challenging to leverage naive image features directly for image-text alignment in downstream tasks. In this work, we propose Federated Disentangled Tuning with Textual Prior Decoupling and Visual Dynamic Adaptation (FedDDA) to overcome the above limitations. Specifically, we encourage decoupling prompts in a way that maximizes the efficacy of prior knowledge, which is essential for maintaining a coherent linguistic context. Furthermore, we design a visual adaption model to reshape visual space to optimally align with the textual space. Extensive experiments on various image classification tasks show the effectiveness of our work in addressing data heterogeneity.
Paperid:3318
Authors:Jungtaek Kim
Abstract:
Bayesian optimization has attracted huge attention from diverse research areas in science and engineering, since it is capable of efficiently finding a global optimum of an expensiveto-evaluate black-box function. In general, a probabilistic regression model is widely used as a surrogate function to model an explicit distribution over function evaluations given an input to estimate and a training dataset. Beyond the probabilistic regression-based methods, density ratio estimation-based Bayesian optimization has been suggested in order to estimate a density ratio of the groups relatively close and relatively far to a global optimum. Developing this line of research further, supervised classifiers are employed to estimate a class probability for the two groups instead of a density ratio. However, the supervised classifiers used in this strategy are prone to be overconfident for known knowledge on global solution candidates. Supposing that we have access to unlabeled points, e.g., predefined fixed-size pools, we propose density ratio estimation-based Bayesian optimization with semi-supervised learning to solve this challenge. Finally, we show the empirical results of our methods and several baseline methods in two distinct scenarios with unlabeled point sampling and a fixed-size pool and analyze the validity of our methods in diverse experiments.
Paperid:3319
Authors:Like Jian · Dong Liu
Abstract:
Federated learning (FL) enables decentralized clients to train a model collaboratively without sharing local data. A key distinction between FL and centralized learning is that clients' data are nonindependent and identically distributed, which poses significant challenges in training a global model that generalizes well across heterogeneous local data distributions. In this paper, we analyze the convergence of overparameterized FedAvg with gradient descent (GD). We prove that the impact of data heterogeneity diminishes as the width of neural networks increases, ultimately vanishing when the width approaches infinity. In the infinite-width regime, we further prove that both the global and local models in FedAvg behave as linear models, and that FedAvg achieves the same generalization performance as centralized learning with the same number of GD iterations. Extensive experiments validate our theoretical findings across various network architectures, loss functions, and optimization methods.
Paperid:3320
Authors:Haochen Zhang · Zhong Zheng · Lingzhou Xue
Abstract:
Abstract:We present the first gapdependent analysis of regret and communication cost for on-policy federated $Q$-Learning in tabular episodic finite-horizon Markov decision processes (MDPs). Existing FRL methods focus on worst-case scenarios, leading to $\sqrt{T}$-type regret bounds and communication cost bounds with a $\log T$ term scaling with the number of agents $M$, states $S$, and actions $A$, where $T$ is the average total number of steps per agent. In contrast, our novel framework leverages the benign structures of MDPs, such as a strictly positive suboptimality gap, to achieve a $\log T$-type regret bound and a refined communication cost bound that disentangles exploration and exploitation. Our gap-dependent regret bound reveals a distinct multi-agent speedup pattern, and our gap-dependent communication cost bound removes the dependence on $MSA$ from the $\log T$ term. Notably, our gap-dependent communication cost bound also yields a better global switching cost when $M=1$, removing $SA$ from the $\log T$ term.
Paperid:3321
Authors:Hongru Hu · Shuwen Zhang · Yongin Choi · Venkat Malladi · Gerald Quon
Abstract:
Singlecell RNA sequencing (scRNA-seq) enables high-resolution exploration of cellular diversity and gene regulation, yet analyzing such data remains challenging due to technical and methodological limitations. Existing task-specific deep generative models like Variational Auto-Encoder (VAE) and its variants struggle to incorporate external biological knowledge, while transformer-based foundational large Language Models (LLMs or large LaMs) face limitations in computational cost and applicability to tabular gene expression data. Here, we introduce sciLaMA (single-cell interpretable Language Model Adapter), a novel representation learning framework that bridges these gaps by integrating static gene embeddings from multimodal LaMs with scRNA-seq tabular data through a paired-VAE architecture. Our approach generates context-aware representations for both cells and genes and outperforms state-of-the-art methods in key single-cell downstream tasks, including batch effect correction, cell clustering, and cell-state-specific gene marker and module identification, while maintaining computational efficiency. sciLaMA offers a computationally efficient, unified framework for comprehensive single-cell data analysis and biologically interpretable gene module discovery.
Paperid:3322
Authors:Omer Ronen · Ahmed Imtiaz Humayun · Richard Baraniuk · Randall Balestriero · Bin Yu
Abstract:
We develop Latent Exploration Score (LES) to mitigate overexploration in Latent Space Optimization (LSO), a popular method for solving black-box discrete optimization problems. LSO utilizes continuous optimization within the latent space of a Variational Autoencoder (VAE) and is known to be susceptible to over-exploration, which manifests in unrealistic solutions that reduce its practicality. LES leverages the trained decoder’s approximation of the data distribution, and can be employed with any VAE decoder–including pretrained ones–without additional training, architectural changes or access to the training data. Our evaluation across five LSO benchmark tasks and twenty-two VAE models demonstrates that LES always enhances the quality of the solutions while maintaining high objective values, leading to improvements over existing solutions in most cases. We believe that new avenues to LSO will be opened by LES’ ability to identify out of distribution areas, differentiability, and computational tractability.
Paperid:3323
Authors:Rahul Choudhary · Hanbaek Lyu
Abstract:
Abstract:The classical static Schrödinger Bridge (SSB) problem, which seeks the most likely stochastic evolution between two marginal probability measures, has been studied extensively in the optimal transport and statistical physics communities, and more recently in machine learning communities in the surge of generative models. The standard approach to solve SSB is to first identify its Kantorovich dual and use Sinkhorn's algorithm to find the optimal potential functions. While the original SSB is only a strictly convex minimization problem, this approach is known to warrant linear convergence under mild assumptions. In this work, we consider a generalized SSB allowing any strictly increasing divergence functional, far generalizing the entropy functional $x\log x$ in the standard SSB. This problem naturally arises in a wide range of seemingly unrelated problems in entropic optimal transport, random graphs/matrices, and combinatorics. We establish Kantorovich duality and linear convergence of Sinkhorn's algorithm for the generalized SSB problem under mild conditions. Our results provide a new rigorous foundation for understanding Sinkhorntype iterative methods in the context of large-scale generalized Schrödinger bridges.
Paperid:3324
Authors:Yixin Zha · Chuxin Wang · Wenfei Yang · Tianzhu Zhang · Feng Wu
Abstract:
A series of pretrained models have demonstrated promising results in point cloud understanding tasks and are widely applied to downstream tasks through fine-tuning. However, full fine-tuning leads to the forgetting of pretrained knowledge and substantial storage costs on edge devices. To address these issues, Parameter-Efficient Transfer Learning (PETL) methods have been proposed. According to our analysis, we find that existing 3D PETL methods cannot adequately align with semantic relationships of features required by downstream tasks, resulting in suboptimal performance. To ensure parameter efficiency while introducing rich semantic cues, we propose a novel fine-tuning paradigm for 3D pre-trained models. We utilize frozen 2D pre-trained models to provide vision semantic prompts and design a new Hybrid Attention Adapter to efficiently fuse 2D semantic cues into 3D representations with minimal trainable parameters(1.8M). Extensive experiments conducted on datasets including ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our proposed paradigm. In particular, our method achieves 95.6% accuracy on ModelNet40 and attains 90.09% performance on the most challenging classification split ScanObjectNN(PB-T50-RS).
Paperid:3325
Authors:Shizhan Gong · Yankai Jiang · DOU QI · Farzan Farnia
Abstract:
Visionlanguage models, such as CLIP, have achieved significant success in aligning visual and textual representations, becoming essential components of many multi-modal large language models (MLLMs) like LLaVA and OpenFlamingo. However, numerous studies have identified CLIP's limited fine-grained perception as a critical drawback, leading to substantial failures in downstream MLLMs. In contrast, vision-centric foundation models like DINOv2 demonstrate remarkable capabilities in capturing fine details from images. In this work, we propose a novel kernel-based method to align CLIP's visual representation with that of DINOv2, ensuring that the resulting embeddings maintain compatibility with text embeddings while enhancing perceptual capabilities. Our alignment objective is designed for efficient stochastic optimization. Following this image-only alignment fine-tuning, the visual encoder retains compatibility with the frozen text encoder and exhibits significant improvements in zero-shot object recognition, fine-grained spatial reasoning, and localization. By integrating the aligned visual encoder, downstream MLLMs also demonstrate enhanced performance. The code and models will be made publicly available.
Paperid:3326
Authors:Zhixuan Chen · Xing Hu · Dawei Yang · Zukang Xu · JiangyongYu · XUCHEN · Zhihang Yuan · Sifan Zhou
Abstract:
Mixtureof-Experts (MoE) large language models (LLMs), which leverage dynamic routing and sparse activation to enhance efficiency and scalability, have achieved higher performance while reducing computational costs. However, these models face significant memory overheads, limiting their practical deployment and broader adoption. Post-training quantization (PTQ), a widely used method for compressing LLMs, encounters severe accuracy degradation and diminished generalization performance when applied to MoE models. This paper investigates the impact of MoE’s sparse and dynamic characteristics on quantization and identifies two primary challenges: (1) Inter-expert imbalance, referring to the uneven distribution of samples across experts, which leads to insufficient and biased calibration for less frequently utilized experts; (2) Intra-expert imbalance, arising from MoE's unique aggregation mechanism, which leads to varying degrees of correlation between different samples and their assigned experts. To address these challenges, we propose MoEQuant, a novel quantization framework tailored for MoE LLMs. MoEQuant includes two novel techniques: 1) Expert-Balanced Self-Sampling (EBSS) is an efficient sampling method that efficiently constructs a calibration set with balanced expert distributions by leveraging the cumulative probabilities of tokens and expert balance metrics as guiding factors. 2) Affinity-Guided Quantization (AGQ), which incorporates affinities between experts and samples into the quantization process, thereby accurately assessing the impact of individual samples on different experts within the MoE layer. Experiments demonstrate that MoEQuant achieves substantial performance gains (more than 10 points accuracy gain in the HumanEval for DeepSeekMoE-16B under 4-bit quantization) and boosts efficiency.
Paperid:3327
Authors:Linjing You · Jiabao Lu · Xiayuan Huang
Abstract:
Deep neural networks often experience performance drops due to distribution shifts between training and test data. Although domain adaptation offers a solution, privacy concerns restrict access to training data in many realworld scenarios. This restriction has spurred interest in Test-Time Adaptation (TTA), which adapts models using only unlabeled test data. However, current TTA methods still face practical challenges: (1) a primary focus on instance-wise alignment, overlooking CORrelation ALignment (CORAL) due to missing source correlations; (2) complex backpropagation operations for model updating, resulting in overhead computation and (3) domain forgetting.To address these challenges, we provide a theoretical analysis to investigate the feasibility of Test-time Correlation Alignment (TCA), demonstrating that correlation alignment between high-certainty instances and test instances can enhance test performances with a theoretical guarantee. Based on this, we propose two simple yet effective algorithms: LinearTCA and LinearTCA+. LinearTCA applies a simple linear transformation to achieve both instance and correlation alignment without additional model updates, while LinearTCA+ serves as a plug-and-play module that can easily boost existing TTA methods. Extensive experiments validate our theoretical insights and show that TCA methods significantly outperforms baselines across various tasks, benchmarks and backbones. Notably, LinearTCA improves adaptation accuracy by 5.88% on OfficeHome dataset, while using only 4% maximum GPU memory usage and 0.6% computation time compared to the best baseline TTA method.
Paperid:3328
Authors:Yadong Sun · Xiaofeng Cao · Ivor Tsang · Heng Tao Shen
Abstract:
Static systems exhibit diverse structural properties, such as hierarchical, scalefree, and isotropic patterns, where different geometric spaces offer unique advantages. Methods combining multiple geometries have proven effective in capturing these characteristics. However, real-world systems often evolve dynamically, introducing significant challenges in modeling their temporal changes. To overcome this limitation, we propose a unified cross-geometric learning framework for dynamic systems, which synergistically integrates Euclidean and hyperbolic spaces, aligning embedding spaces with structural properties through fine-grained substructure modeling. Our framework further incorporates a temporal state aggregation mechanism and an evolution-driven optimization objective, enabling comprehensive and adaptive modeling of both nodal and relational dynamics over time. Extensive experiments on diverse real-world dynamic graph datasets highlight the superiority of our approach in capturing complex structural evolution, surpassing existing methods across multiple metrics.
Paperid:3329
Authors:Gaotang Li · Yuzhong Chen · Hanghang Tong
Abstract:
Language Models (LMs) often encounter knowledge conflicts when parametric memory contradicts contextual knowledge. Previous works attribute this conflict to the interplay between "memory heads'' and "context heads'', attention heads assumed to promote either memory or context exclusively.In this study, we go beyond this fundamental assumption by uncovering a critical phenomenon we term the “superposition of contextual information and parametric memory”, where highly influential attention heads could simultaneously contribute to both memory and context. Building upon this insight, we propose Just Run Twice (JUICE), a testtime attention intervention method that steers LMs toward either parametric beliefs or contextual knowledge without requiring fine-tuning. JUICE identifies a set of reliable attention heads and leverages a dual-run approach to mitigate the superposition effects.Extensive experiments across 11 datasets and 6 model architectures demonstrate that JUICE sets the new state-of-the-art performance and robust generalization, achieving significant and consistent improvement across different domains under various conflict types.Finally, we theoretically analyze knowledge conflict and the superposition of contextual information and parametric memory in attention heads, which further elucidates the effectiveness of JUICE in these settings.
Paperid:3330
Authors:Moritz Piening · Florian Beier · Robert Beinert · Gabriele Steidl
Abstract:
We propose a new approach for unsupervised alignment of heterogeneous datasets, which maps data from two different domains without any known correspondences to a common metric space. Our method is based on an unbalanced optimal transport problem with GromovWasserstein marginal penalization. It can be seen as a counterpart to the recently introduced joint multidimensional scaling method. We prove that there exists a minimizer of our functional and that for penalization parameters going to infinity, the corresponding sequence of minimizers converges to a minimizer of the so-called embedded Wasserstein distance. Our model can be reformulated as a quadratic, multi-marginal, unbalanced optimal transport problem, for which a bi-convex relaxation admits a numerical solver via block-coordinate descent. We provide numerical examples for joint embeddings in Euclidean as well as non-Euclidean spaces.
Paperid:3331
Authors:Yuang Zhang · Jiaxi Gu · Li-Wen Wang · Han Wang · JunqiCheng · Yuefeng Zhu · FangYuan Zou
Abstract:
In recent years, while generative AI has advanced significantly in image generation, video generation continues to face challenges in controllability, length, and detail quality, which hinder its application. We present MimicMotion, a framework for generating highquality human videos of arbitrary length using motion guidance. Our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which reduces image distortion in key regions. Lastly, we propose a progressive latent fusion strategy to generate long and smooth videos. Experiments demonstrate the effectiveness of our approach in producing high-quality human motion videos. Videos and comparisons are available atmimicmotion.github.io.
Paperid:3332
Authors:Xinjie Yao · Yu Wang · Pengfei Zhu · Wanyu LIN · Ruipu Zhao · Zhoupeng Guo · Weihao Li · Qinghua Hu
Abstract:
Traditional machine societies rely on datadriven learning, overlooking interactions and limiting knowledge acquisition from model interplay. To address these issues, we revisit the development of machine societies by drawing inspiration from the evolutionary processes of human societies. Motivated by Social Learning (SL), this paper introduces a practical paradigm of Socialized Coevolution (SC). Compared to most existing methods focused on knowledge distillation and multi-task learning, our work addresses a more challenging problem: not only enhancing the capacity to solve new downstream tasks but also improving the performance of existing tasks through inter-model interactions. Inspired by cognitive science, we propose Dynamic Information Socialized Collaboration (DISC), which achieves SC through interactions between models specialized in different downstream tasks. Specifically, we introduce the dynamic hierarchical collaboration and dynamic selective collaboration modules to enable dynamic and effective interactions among models, allowing them to acquire knowledge from these interactions. Finally, we explore potential future applications of combining SL and SC, discuss open questions, and propose directions for future research, aiming to spark interest in this emerging and exciting interdisciplinary field.
Paperid:3333
Authors:Jingang QU · David Holzmüller · Gael Varoquaux · Marine Le Morvan
Abstract:
The longstanding dominance of gradient-boosted decision trees on tabular data is currently challenged by tabular foundation models using In-Context Learning (ICL): setting the training data as context for the test data and predicting in a single forward pass without parameter updates. While the very recent TabPFNv2 foundation model (2025) excels on tables with up to 10K samples, its alternating column- and row-wise attentions make handling large training sets computationally prohibitive. So, can ICL be effectively scaled and deliver a benefit for larger tables? We introduce TabICL, a tabular foundation model pretrained on datasets with up to 60K samples and handling 500K samples on affordable resources. This is enabled by a novel two-stage architecture: a column-then-row attention mechanism to build fixed-dimensional embeddings of rows, followed by a transformer for efficient ICL. On the TALENT benchmark with 200 datasets, TabICL is on par with TabPFNv2 while being systematically faster (up to 10 times), and significantly outperforms all other approaches. On the 56 datasets with over 10K samples, TabICL surpasses both TabPFNv2 and CatBoost, demonstrating the potential of ICL for large data.
Paperid:3334
Authors:Aaditya Naik · Jason Liu · Claire Wang · Amish Sethi · Saikat Dutta · Mayur Naik · Eric Wong
Abstract:
Neurosymbolic learning enables the integration of symbolic reasoning with deep learning but faces significant challenges in scaling to complex symbolic programs, large datasets, or both. We introduce DOLPHIN, a framework that tackles these challenges by supporting neurosymbolic programs in Python, executing complex symbolic reasoning on the CPU while vectorizing probabilistic computations and gradient propagation on the GPU. Across 13 benchmarks spanning tasks over text, image, and video data, with symbolic reasoning features like recursion and blackbox functions, DOLPHIN converges to state-of-the-art accuracies on the more complex benchmarks while existing frameworks such as Scallop, ISED, and IndeCateR+ fail to converge within the time limit. On simpler benchmarks, DOLPHIN matches their performance, while achieving these results 1.71x to 62x faster than the baselines. Overall, DOLPHIN advances the scalability of neurosymbolic frameworks, achieving state-of-the-art efficiency and convergence on difficult benchmarks where existing frameworks struggle.
Paperid:3335
Authors:Hyeonah Kim · Minsu Kim · Taeyoung Yun · Sanghyeok Choi · Emmanuel Bengio · Alex Hernandez-Garcia · Jinkyoo Park
Abstract:
Abstract:Designing biological sequences with desired properties is challenging due to vast search spaces and limited evaluation budgets. Although reinforcement learning methods use proxy models for rapid reward evaluation, insufficient training data can cause proxy misspecification on outof-distribution inputs. To address this, we propose a novel off-policy search, $\delta$-Conservative Search, that enhances robustness by restricting policy exploration to reliable regions. Starting from high-score offline sequences, we inject noise by randomly masking tokens with probability $\delta$, then denoise them using our policy. We further adapt $\delta$ based on proxy uncertainty on each data point, aligning the level of conservativeness with model confidence. Experimental results show that our conservative search consistently enhances the off-policy training, outperforming existing machine learning methods in discovering high-score sequences across diverse tasks, including DNA, RNA, protein, and peptide design.
Paperid:3336
Authors:Han Jiang · Xiaoyuan Yi · Zhihua Wei · Ziang Xiao · Shu Wang · Xing Xie
Abstract:
Warning: Contains harmful model outputs.Despite significant advancements, the propensity of Large Language Models (LLMs) to generate harmful and unethical content poses critical challenges.Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment. Although numerous benchmarks have been constructed to assess social bias, toxicity, and ethical issues in LLMs, those static benchmarks suffer fromevaluation chronoeffect, in which, as models rapidly evolve, existing benchmarks may leak into training data or become saturated,overestimatingeverdeveloping LLMs. To tackle this problem, we propose GETA, a novelgenerative evolving testingapproach based on adaptive testing methods in measurement theory. Unlike traditional adaptive testing methods that rely on a static test item pool, GETA probes the underlying moral boundaries of LLMs by dynamically generating test items tailored to model capability. GETA co-evolves with LLMs by learning a joint distribution of item difficulty and model value conformity, thus effectively addressing evaluation chronoeffect. We evaluated various popular LLMs with GETA and demonstrated that 1) GETA can dynamically create difficulty-tailored test items and 2) GETA's evaluation results are more consistent with models' performance on unseen OOD and i.i.d. items, laying the groundwork for future evaluation paradigms.
Paperid:3337
Authors:Ali Modarressi · Hanieh Deilamsalehy · Franck Dernoncourt · Trung Bui · Ryan A Rossi · Seunghyun Yoon · Hinrich Schuetze
Abstract:
Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needlein-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.
Paperid:3338
Authors:Jinze Li · Yixing Xu · Haiduo Huang · Xuanwu Yin · Dong Li · Edith Ngai · Emad Barsoum
Abstract:
Speculative decoding (SPD) aims to accelerate the autoregressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness.
Paperid:3339
Authors:Arnab Bhattacharyya · Davin Choo · Philips George John · Themis Gouleakis
Abstract:
Abstract:We revisit the problem of distribution learning within the framework of learningaugmented algorithms.In this setting, we explore the scenario where a probability distribution is provided as potentially inaccurate advice on the true, unknown distribution. Our objective is to develop learning algorithms whose sample complexity decreases as the quality of the advice improves, thereby surpassing standard learning lower bounds when the advice is sufficiently accurate. Specifically, we demonstrate that this outcome is achievable for the problem of learning a multivariate Gaussian distribution $N(\mu, \Sigma)$ in the PAC learning setting. Classically, in the advice-free setting, $\widetilde{\Theta}(d^2/\varepsilon^2)$ samples are sufficient and worst case necessary to learn $d$-dimensional Gaussians up to TV distance $\varepsilon$ with constant probability. When we are additionally given a parameter $\widetilde{\Sigma}$ as advice, we show that $\widetilde{\mathcal{O}}(d^{2-\beta}/\varepsilon^2)$ samples suffices whenever $|| \widetilde{\Sigma}^{-1/2} \Sigma \widetilde{\Sigma}^{-1/2} - I_d ||_1 \leq \varepsilon d^{1-\beta}$ (where $||\cdot||_1$ denotes the entrywise $\ell_1$ norm) for any $\beta > 0$, yielding a polynomial improvement over the advice-free setting.
Paperid:3340
Authors:Lanyun Zhu · Deyi Ji · Tianrun Chen · Haiyang Wu · De Wen Soh · Jun Liu
Abstract:
Referring MLLMs extend conventional multimodal large language models by allowing them to receive referring visual prompts and generate responses tailored to the indicated regions. However, these models often suffer from suboptimal performance due to incorrect responses tailored to misleading areas adjacent to or similar to the target region. This work introduces CPCF, a novel framework to address this issue and achieve superior results. CPCF contrasts outputs generated from the indicated visual prompt with those from contrastive prompts sampled from misleading regions, effectively suppressing the influence of erroneous information outside the target region on response generation. To further enhance the effectiveness and efficiency of our framework, several novel designs are proposed, including a prompt extraction network to automatically identify suitable contrastive prompts, a selftraining method that leverages unlabeled data to improve training quality, and a distillation approach to reduce the additional computational overhead associated with contrastive decoding. Incorporating these novel designs, CPCF achieves state-of-the-art performance, as demonstrated by extensive experiments across multiple benchmarks.
Paperid:3341
Authors:Reyhane Askari Hemmat · Mohammad Pezeshki · Elvis Dohmatob · Florian Bordes · Pietro Astolfi · Melissa Hall · Jakob Verbeek · Michal Drozdzal · Adriana Romero-Soriano
Abstract:
Abstract:Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet100, DP generates 3.4$\times$ fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8$\times$ fewer samples with a 30% reduction in iterations, all while achieving superior performance compared to prior work.
Paperid:3342
Authors:Jingwei Li · Jing Xu · Zifan Wang · Huishuai Zhang · Jingzhao Zhang
Abstract:
One explanation for the strong generalization ability of neural networks is implicit bias. Yet, the definition and mechanism of implicit bias in nonlinear contexts remains little understood. In this work, we propose to characterize implicit bias by the count of connected regions in the input space with the same predicted label. Compared with parameter-dependent metrics (e.g., norm or normalized margin), region count can be better adapted to nonlinear, overparameterized models, because it is determined by the function mapping and is invariant to reparametrization. Empirically, we found that small region counts align with geometrically simple decision boundaries and correlate well with good generalization performance. We also observe that good hyper-parameter choices such as larger learning rates and smaller batch sizes can induce small region counts. We further establish the theoretical connections and explain how larger learning rate can induce small region counts in neural networks.
Paperid:3343
Authors:Artur Back de Luca · George Giapitzakis · Shenghao Yang · Petar Veličković · Kimon Fountoulakis
Abstract:
There is a growing interest in the ability of neural networks to execute algorithmic tasks (e.g., arithmetic, summary statistics, and sorting).The goal of this work is to better understand the role of attention in Transformers for algorithmic execution. Its importance for algorithmic execution has been studied theoretically and empirically using parallel computational models. Notably, many parallel algorithms communicate between processors solely using positional information. Inspired by this observation, we investigate how Transformers can execute algorithms using positional attention, where attention weights depend exclusively on positional encodings. We prove that Transformers with positional attention (positional Transformers) maintain the same expressivity of parallel computational models, incurring a logarithmic depth cost relative to the input length. We analyze their indistribution learnability and explore how parameter norms in positional attention affect sample complexity. Our results show that positional Transformers introduce a learning trade-off: while they exhibit better theoretical dependence on parameter norms, certain tasks may require more layers, which can, in turn, increase sample complexity. Finally, we empirically explore the out-of-distribution performance of positional Transformers and find that they perform well in tasks where their underlying algorithmic solution relies on positional information.
Paperid:3344
Authors:Grigory Bartosh · Dmitry Vetrov · Christian Andersson Naesseth
Abstract:
The Latent Stochastic Differential Equation (SDE) is a powerful tool for time series and sequence modeling. However, training Latent SDEs typically relies on adjoint sensitivity methods, which depend on simulation and backpropagation through approximate SDE solutions, which limit scalability. In this work, we propose SDE Matching, a new simulationfree method for training Latent SDEs. Inspired by modern Score- and Flow Matching algorithms for learning generative dynamics, we extend these ideas to the domain of stochastic dynamics for time series modeling, eliminating the need for costly numerical simulations. Our results demonstrate that SDE Matching achieves performance comparable to adjoint sensitivity methods while drastically reducing computational complexity.
Paperid:3345
Authors:Zishun Yu · Tengyu Xu · Di Jin · Karthik Abinav Sankararaman · Yun He · Wenxuan Zhou · Zhouhao Zeng · Eryk Helenowski · Chen Zhu · Sinong Wang · Hao Ma · Han Fang
Abstract:
Abstract:Solving mathematics problems has been an intriguing capability of large language models, and many efforts have been made to improve reasoning by extending reasoning length, such as through selfcorrection and extensive long chain-of-thoughts. While promising in problem-solving, advanced long reasoning chain models exhibit an undesired single-modal behavior, where trivial questions require unnecessarily tedious long chains of thought. In this work, we propose a way to allow models to be aware of inference budgets by formulating it as utility maximization with respect to an inference budget constraint, hence naming our algorithm Inference Budget-Constrained Policy Optimization (IBPO). In a nutshell, models fine-tuned through IBPO learn to ``understand'' the difficulty of queries and allocate inference budgets to harder ones. With different inference budgets, our best models are able to have a $4.14$\% and $5.74$\% absolute improvement ($8.08$\% and $11.2$\% relative improvement) on MATH500 using $2.16$x and $4.32$x inference budgets respectively, relative to LLaMA3.1 8B Instruct. These improvements are approximately $2$x those of self-consistency under the same budgets.
Paperid:3346
Authors:Eric Wang · Zhichao Chen · Haotian Wang · Yanchao Tan · Licheng Pan · Tianqiao Liu · Xu Chen · Haoxuan Li · Zhouchen Lin
Abstract:
Implicit feedback recommendation is challenged by the missing negative feedback essential for effective model training. Existing methods often resort to negative sampling, a technique that assumes unlabeled interactions as negative samples. This assumption risks misclassifying potential positive samples within the unlabeled data, thereby undermining model performance. To address this issue, we introduce PURL, a modelagnostic framework that reframes implicit feedback recommendation as a weakly supervised learning task, eliminating the need for negative samples. However, its unbiasedness hinges on the accurate estimation of the class prior. To address this challenge, we propose Progressive Proximal Transport (PPT), which estimates the class prior by minimizing the proximal transport cost between positive and unlabeled samples. Experiments on three real-world datasets validate the efficacy of PURL in terms of improved recommendation quality. Code is available at \url{https://anonymous.4open.science/r/PURL-1AF3/}.
Paperid:3347
Authors:Weixuan Liang · Xinwang Liu · KE LIANG · Jiyuan Liu · En Zhu
Abstract:
Abstract:Inspired by the wellknown coreset in clustering algorithms, we introduce the definition of the core kernel for multiple kernel clustering (MKC) algorithms. The core kernel refers to running MKC algorithms on smaller-scale base kernel matrices to obtain kernel weights similar to those derived using the standard base kernel matrices as input. Specifically, the core kernel refers to a set of kernel matrices of size $\widetilde{\mathcal{O}} (1/\varepsilon^2)$ that perform MKC algorithms on them can achieve $(1+\varepsilon)$-approximation for the kernel weights. Subsequently, we can leverage approximated kernel weights to obtain theoretically guaranteed large-scale extension of MKC algorithms. In this paper, we propose a core kernel construction method based on singular value decomposition and prove that it satisfies the definition of the core kernel for three mainstream MKC algorithms. Finally, we conduct experiments on several benchmark datasets to verify the correctness of theoretical results and the efficiency of the proposed method.
Paperid:3348
Authors:Jaeho Kim · Seulki Lee
Abstract:
Abstract:Unsupervised domain adaptation (UDA) for time series data remains a critical challenge in deep learning, with traditional pseudolabeling strategies failing to capture temporal patterns and channel-wise shifts between domains, producing sub-optimal pseudo labels. As such, we introduce TransPL, a novel approach that addresses these limitations by modeling the joint distribution $P(X,y)$ of the source domain through code transition matrices, where the codes are derived from vector quantization (VQ) of time series patches. Our method constructs class- and channel-wise code transition matrices from the source domain and employs Bayes' rule for target domain adaptation, generating pseudo-labels based on channel-wise weighted class-conditional likelihoods. TransPL offers three key advantages: explicit modeling of temporal transitions and channel-wise shifts between different domains, versatility towards different UDA scenarios (e.g., weakly-supervised UDA), and explainable pseudo-label generation. We validate TransPL's effectiveness through extensive analysis on four time series UDA benchmarks and confirm that it consistently outperforms state-of-the-art pseudo-labeling methods by a strong margin (6.1\% accuracy improvement, 4.9\% F1 improvement), while providing interpretable insights into the domain adaptation process through its learned code transition matrices.
Paperid:3349
Authors:Naz Sepahvand · Anvith Thudi · Berivan Isik · Ashmita Bhattacharyya · Nicolas Papernot · Eleni Triantafillou · Daniel Roy · Gintare Karolina Dziugaite
Abstract:
We present a principled, perinstance approach to quantifying the difficulty of unlearning via fine-tuning. We begin by sharpening an analysis of noisy gradient descent for unlearning (Chien et al., 2024), obtaining a better utility–unlearning tradeoff by replacing worst-case privacy loss bounds with per-instance privacy losses (Thudi et al.,2024), each of which bounds the (Renyi) divergence to retraining without an individual data point. To demonstrate the practical applicability of our theory, we present empirical results showing that our theoretical predictions are born out both for Stochastic Gradient Langevin Dynamics (SGLD) as well as for standard fine-tuning without explicit noise. We further demonstrate that per-instance privacy losses correlate well with several existing data difficulty metrics, while also identifying harder groups of data points, and introduce novel evaluation methods based on loss barriers. All together, our findings provide a foundation for more efficient and adaptive unlearning strategies tailored to the unique properties of individual data points.
Paperid:3350
Authors:Elizabeth Baker · Jakiw Pidstrigach · Carles Domingo i Enrich · George Deligiannidis · Nikolas Nüsken
Abstract:
In stochastic optimal control and conditional generative modelling, a central computational task is to modify a reference diffusion process to maximise a given terminaltime reward. Most existing methods require this reward to be differentiable, using gradients to steer the diffusion towards favourable outcomes. However, in many practical settings, like diffusion bridges, the reward is singular, taking an infinite value if the target is hit and zero otherwise. We introduce a novel framework, based on Malliavin calculus and path-space integration by parts, that enables the development of methods robust to such singular rewards. We demonstrate that our approach offers stable and reliable training, outperforming existing techniques.
Paperid:3351
Authors:Shixi Qin · Zhiyong Yang · Shilong Bao · Shi Wang · Qianqian Xu · Qingming Huang
Abstract:
This paper focuses on embedding multiple heterogeneous backdoor triggers in bridgebased diffusion models designed for complex and arbitrary input distributions. Existing backdoor formulations mainly address single-attack scenarios and are limited to models initialized with Gaussian noise. To bridge this gap, we propose \texttt{MixBridge}, a novel diffusion Schrödinger bridge (DSB) framework. Specifically, we first demonstrate that backdoor triggers can be injected into \texttt{MixBridge} by directly training with poisoned image pairs. This eliminates the need for the cumbersome modifications to stochastic differential equations required in previous studies. However, we theoretically show that a single DSB is insufficient for heterogeneous attacks, as the model will follow the geometric mean of the benign and multiple backdoored distributions, leading to performance conflict between different tasks. To overcome this, we propose a Divide-and-Merge strategy, where models are independently pre-trained for each specific objective (\texttt{Divide}) and then integrated into a unified model (\texttt{Merge}). On top of this, a Weight Reallocation Scheme (WRS) is further designed to enhance the stealthiness of \texttt{MixBridge}. Empirical studies across diverse generation tasks speak to the efficacy of \texttt{MixBridge}.
Paperid:3352
Authors:Muzhou Yu · Shuyun Lin · Lei Ma · Bo Lei · Kaisheng Ma
Abstract:
Advancements in generative models have promoted textand image-based multi-context image generation. Brain signals, offering a direct representation of user intent, present new opportunities for image customization. However, it faces challenges in brain interpretation, cross-modal context fusion and retention. In this paper, we propose MindCustomer to explore the blending of visual brain signals in multi-context image generation. We introduce the Image-Brain Translator (IBT) to augment shared data for cross-subject brain embedding. Further, IBT facilitates the alignment from image latent space to brain representation, enabling natural cross-modal context fusion. By incorporating diffusion fine-tuning with brain embedding optimization, our method resolves semantic conflicts effectively, achieving a more harmonious and cohesive context integration. MindCustomer enables cross-subject generation, delivering unified, high-quality, and natural image outputs. Moreover, it exhibits strong generalization for new subjects via few-shot learning, showcasing the potential for practical application. As the first work for brain-blended image customization, MindCustomer lays a foundational exploration and inspiration for future brain-controlled generative technologies.
Paperid:3353
Authors:Yan Zhong · Chenxi Yang · Suyuan Zhao · Tingting Jiang
Abstract:
This paper presents CPLIQA, a novel semi-supervised blind image quality assessment (BIQA) framework for authentic distortion scenarios. To address the challenge of limited labeled data in IQA area, our approach leverages confidence-quantifiable pseudo-label learning to effectively utilize unlabeled authentically distorted images. The framework operates through a preprocessing stage and two training phases: first converting MOS labels to vector labels via entropy minimization, followed by an iterative process that alternates between model training and label optimization. The key innovations of CPL-IQA include a manifold assumption-based label optimization strategy and a confidence learning method for pseudo-labels, which enhance reliability and mitigate outlier effects. Experimental results demonstrate the framework's superior performance on real-world distorted image datasets, offering a more standardized semi-supervised learning paradigm without requiring additional supervision or network complexity.
Paperid:3354
Authors:Cameron Jakub · Mihai Nica
Abstract:
Despite remarkable performance on a variety of tasks, many properties of deep neural networks are not yet theoretically understood. One such mystery is the depth degeneracy phenomenon: the deeper you make your network, the closer your network is to a constant function on initialization. In this paper, we examine the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers. By using combinatorial expansions, we find precise formulas for how fast this angle goes to zero as depth increases. These formulas capture microscopic fluctuations that are not visible in the popular framework of infinite width limits, and leads to qualitatively different predictions. We validate our theoretical results with Monte Carlo experiments and show that our results accurately approximate finite network behaviour. We also empirically investigate how the depth degeneracy phenomenon can negatively impact training of real networks. The formulas are given in terms of the mixed moments of correlated Gaussians passed through the ReLU function. We also find a surprising combinatorial connection between these mixed moments and the Bessel numbers that allows us to explicitly evaluate these moments.
Paperid:3355
Authors:Chunlok Lo · Kevin Roice · Parham Mohammad Panahi · Scott Jordan · Adam White · Gabor Mihucz · Farzane Aminmansour · Martha White
Abstract:
This paper investigates a new approach to modelbased reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a given set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning, and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains.
Paperid:3356
Authors:Andrew Patterson · Samuel F Neumann · Martha White · Adam White
Abstract:
Empirical design in reinforcement learning is no small task. Running good experiments requires attention to detail and at times significant computational resources. While compute resources available per dollar have continued to grow rapidly, so have the scale of typical experiments in reinforcement learning. It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience. The scale of these experiments often conflict with the need for statistical evidence, especially when comparing algorithms. Recent studies have highlighted how popular algorithms are sensitive to hyperparameter settings and implementation details, and that common empirical practice leads to weak statistical evidence (Machado et al., 2018; Henderson et al., 2018).This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning. In particular, we cover: the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, hypothesis testing, special considerations for comparing multiple agents, baseline and illustrative example construction, and how to deal with hyperparameters and experimenter bias. Throughout we highlight common mistakes found in the literature and the statistical consequences of those in example experiments. The objective of this document is to provide answers on how we can use our unprecedented compute to do good science in reinforcement learning, as well as stay alert to potential pitfalls in our empirical design.
Paperid:3357
Authors:Hengrui Luo · Jeremy E. Purvis · Didong Li
Abstract:
Modern datasets often exhibit high dimensionality, yet the data reside in lowdimensional manifolds that can reveal underlying geometric structures critical for data analysis. A prime example of such a dataset is a collection of cell cycle measurements, where the inherently cyclical nature of the process can be represented as a circle or sphere. Motivated by the need to analyze these types of datasets, we propose a nonlinear dimension reduction method, Spherical Rotation Component Analysis (SRCA), that incorporates geometric information to better approximate low-dimensional manifolds. SRCA is a versatile method designed to work in both high-dimensional and small sample size settings. By employing spheres or ellipsoids, SRCA provides a low-rank spherical representation of the data with general theoretic guarantees, effectively retaining the geometric structure of the dataset during dimensionality reduction. A comprehensive simulation study, along with a successful application to human cell cycle data, further highlights the advantages of SRCA compared to state-of-the-art alternatives, demonstrating its superior performance in approximating the manifold while preserving inherent geometric structures.
Paperid:3358
Authors:Constantin Philippenko · Aymeric Dieuleveut
Abstract:
Abstract:In this paper, we investigate the impact of compression on stochastic gradient algorithms for machine learning, a technique widely used in distributed and federated learning. We underline differences in terms of convergence rates between several unbiased compression operators, that all satisfy the same condition on their variance, thus going beyond the classical worstcase analysis. To do so, we focus on the case of least-squares regression (LSR) and analyze a general stochastic approximation algorithm for minimizing quadratic functions relying on a random field. We consider weak assumptions on the random field, tailored to the analysis (specifically, expected H{{\"o}}lder regularity), and on the noise covariance, enabling the analysis of various randomizing mechanisms, including compression. We then extend our results to the case of federated learning. More formally, we highlight the impact on the convergence of the covariance $\mathfrak{C}_{\mathrm{ania}}$ of the additive noise induced by the algorithm. We demonstrate despite the non-regularity of the stochastic field, that the limit variance term scales with $\mathrm{Tr}(\mathfrak{C}_{\mathrm{ania}} H^{-1})/K$ (where $H$ is the Hessian of the optimization problem and $K$ the number of iterations) generalizing the rate for the vanilla LSR case where it is $\sigma^2 \mathrm{Tr}(H H^{-1}) / K = \sigma^2 d / K$ (Bach and Moulines, 2013). Then, we analyze the dependency of $\mathfrak{C}_{\mathrm{ania}}$ on the compression strategy and ultimately its impact on convergence, first in the centralized case, then in two heterogeneous FL frameworks.
Paperid:3359
Authors:Chen Liu · Zhichao Huang · Mathieu Salzmann · Tong Zhang · Sabine Süsstrunk
Abstract:
Adversarial training is a popular method to robustify models against adversarial attacks. However, it exhibits much more severe overfitting than training on clean inputs. In this work, we investigate this phenomenon from the perspective of training instances, i.e., training inputtarget pairs. Based on a quantitative metric measuring the relative difficulty of an instance in the training set, we analyze the model's behavior on training instances of different difficulty levels. This lets us demonstrate that the decay in generalization performance of adversarial training is a result of fitting hard adversarial instances. We theoretically verify our observations for both linear and general nonlinear models, proving that models trained on hard instances have worse generalization performance than ones trained on easy instances, and that this generalization gap increases with the size of the adversarial budget. Finally, we investigate solutions to mitigate adversarial overfitting in several scenarios, including fast adversarial training and fine-tuning a pretrained model with additional data. Our results demonstrate that using training data adaptively improves the model's robustness.
Paperid:3360
Authors:Mark Rowland · Remi Munos · Mohammad Gheshlaghi Azar · Yunhao Tang · Georg Ostrovski · Anna Harutyunyan · Karl Tuyls · Marc G. Bellemare · Will Dabney
Abstract:
We analyse quantile temporaldifference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.
Paperid:3361
Authors:Amir R. Asadi
Abstract:
Machine learning, the predominant approach in the field of artificial intelligence, enables computers to learn from data and experience. In the supervised learning framework, accurate and efficient learning of dependencies between data instances and their corresponding labels requires auxiliary information about the data distribution and the target function. This central concept aligns with the notion of regularization in statistical learning theory. Realworld datasets are often characterized by multiscale data instance distributions and well-behaved, smooth target functions. Scale-invariant probability distributions, such as power-law distributions, provide notable examples of multiscale data instance distributions in various contexts. This paper introduces a hierarchical learning model that leverages such a multiscale data structure with a multiscale entropy-based training procedure and explores its statistical and computational advantages. The hierarchical learning model is inspired by the logical progression in human learning from easy to complex tasks and features interpretable levels. In this model, the logarithm of any data instance’s norm can be construed as the data instance's complexity, and the allocation of computational resources is tailored to this complexity, resulting in benefits such as increased inference speed. Furthermore, our multiscale analysis of the statistical risk yields stronger guarantees compared to conventional uniform convergence bounds.
Paperid:3362
Authors:Abstract:
Abstract:
Paperid:3363
Authors:Victor Castillo · Fernando Soto · Jesús Cabello · Alfonso GomezEspinosa · Jose Antonio Cantoral-Ceballos
Abstract:
We present a novel realtime system for detecting and counting Fire Blight (FB) occurrences in apple orchards using YOLOv10 and YOLO11. Unlike previous studies focused on static images, our system operates directly on live video captured from a moving tractor. A GoPro Hero 12 Black mounted on the vehicle streams video to an onboard computer, which performs real-time processing using OpenCV and Python. With a dataset of 2,726 annotated images, YOLO11 outperformed YOLOv10 in detection accuracy. The proposed system enables efficient, on-the-go disease monitoring and supports precision agriculture through automated logging and analysis.
Paperid:3364
Authors:Hugo Massaroli · Mauricio Farez · Hernan Chaves · Emmanuel Iarussi · Viviana Siless
Abstract:
Deep learning models have shown strong performance in diagnosing Alzheimer’s disease (AD) using neuroimaging data, particularly 18FFDG PET scans, with training datasets largely composed of North American cohorts such as those in the Alzheimer’s Disease Neuroimaging Initiative (ADNI). However, their generalization to underrepresented populations remains underexplored. In this study, we benchmark convolutional and Transformer-based models on the ADNI dataset and assess their generalization performance on a novel Latin American clinical cohort from the FLENI Institute in Buenos Aires, Argentina.We show that while all models achieve high AUCs on ADNI (up to .96, .97), their performance drops substantially on FLENI (down to .82, .80, respectively), revealing a significant domain shift. The tested architectures demonstrated similar performance, calling into question the supposed advantages of transformers for this specific task.Through ablation studies, we identify per-image normalization and a correct sampling selection as key factors for generalization. Occlusion sensitivity analysis further reveals that models trained on ADNI, generally attend to canonical hypometabolic regions for the AD class, but focus becomes unclear for the other classes and for FLENI scans. These findings highlight the need for population-aware validation of diagnostic AI models and motivate future work on domain adaptation and cohort diversification.
Paperid:3365
Authors:Luiz Facury de Souza · Jose Fernandes · Pedro Dutenhefner · Turi Rezende · Gisele Pappa · Gabriela Paixão · Antonio Ribeiro · Wagner Jr.
Abstract:
Clinical trust in classification models remains a critical barrier to deploying automated ECG analysis, despite the transformative potential of deep learning. Although recent models have achieved significant progress in classifying arrhythmias, their reliance on opaque `blackbox" reasoning limits practical utility: clinicians often dismiss classic explainability methods such as attention maps as unconvincing evidence for diagnoses. To address this, we present a multimodal system that generates human-understandable explanations by aligning the ECG features with textual diagnostic criteria from expert annotations. For example, the left bundle Brench Block (LBBB) abnormality is classified not just as a label but through explicit links to descriptors such aswidened QRS complex (>120 ms)'. Trained on paired ECG-text data, our approach performs zero-shot classification, matching supervised performance using a Vision Trasnformer (ViT) without task-specific fine-tuning. By extracting sub-features directly from clinical narratives, such as rhythm irregularities or morphological anomalies, the model grounds its predictions in domain-specific knowledge, mirroring clinician reasoning. This fusion of interpretable feature mapping and robust performance advances a paradigm where AI-driven diagnostics complement, rather than conflict with, the need for transparency in medical decision-making.
Paperid:3366
Authors:Walter Mayor · Johan Obando Ceron · Aaron Courville · Pablo Samuel Castro
Abstract:
The use of parallel actors for data collection hasbeen an effective technique used in reinforcementlearning (RL) algorithms. The manner in whichdata is collected in these algorithms, controlledvia the number of parallel environments and therollout length, induces a form of biasvariancetrade-off; the number of training passes over thecollected data, on the other hand, must strike abalance between sample efficiency and overfit-ting. We conduct an empirical analysis of thesetrade-offs on PPO, one of the most popular RLalgorithms that uses parallel actors, and establishconnections to network plasticity and, more gener-ally, optimization stability. We examine its impacton network architectures, as well as the hyper-parameter sensitivity when scaling data. Our anal-yses indicate that larger dataset sizes can increasefinal performance across a variety of settings, andthat scaling parallel environments is more effec-tive than increasing rollout lengths. These find-ings highlight the critical role of data collectionstrategies in improving agent performance.
Paperid:3367
Authors:Brigitte Mora Reyes · Jennifer Drewyor
Abstract:
Artificial intelligence (AI) systems often reflect biases from economically advanced regions, marginalizing contexts in economically developing regions like Latin America due to imbalanced datasets. This paper examines AI representations of diverse Latin American contexts, revealing disparities between data from economically advanced and developing regions. We highlight how the dominance of English over Spanish, Portuguese, and indigenous languages such as Quechua and Nahuatl perpetuates biases, framing Latin American perspectives through a Western lens. To address this, we introduce a culturally aware dataset rooted in Latin American history and sociopolitical contexts, challenging Eurocentric models. We evaluate six language models on questions testing cultural context awareness, using a novel Cultural Expressiveness metric, statistical tests, and linguistic analyses. Our findings show that some models better capture Latin American perspectives, while others exhibit significant sentiment misalignment (p < 0.001). Fine-tuning Mistral-7B with our dataset improves its cultural expressiveness by 42.9%, advancing equitable AI development. We advocate for equitable AI by prioritizing datasets that reflect Latin American history, indigenous knowledge, and diverse languages, while emphasizing community-centered approaches to amplify marginalized voices. Code and datasets will be available in a public repository.
Paperid:3368
Authors:Maynara de Souza · Cleber Zanchettin
Abstract:
Data scarcity and quality issues remain significant barriers to developing robust predictive models in medical research. Traditional reliance on realworld data often leads to biased models with poor generalizability across diverse patient populations. Synthetic data generation has emerged as a promising solution, yet challenges related to these sample's representativeness and effective utilization persist. This paper introduces Profile2Gen, a novel data-centric framework designed to guide the generation and refinement of synthetic data, focusing on addressing hard-to-learn samples in regression tasks. We conducted approximately 18,000 experiments to validate its effectiveness across six medical datasets, utilizing seven state-of-the-art generative models. Results demonstrate that refined synthetic samples can reduce predictive errors and enhance model reliability. Additionally, we generalize the DataIQ framework to support regression tasks, enabling its application in broader contexts. Statistical analyses confirm that our approach achieves equal or superior performance compared to models trained exclusively on real data.
Paperid:3369
Authors:Joao Dantas
Abstract:
We present an offline languagebased support tool to assist tactical decision-making and training in air combat operations. The system applies retrieval-augmented generation (RAG) to answer natural language questions using content extracted from doctrinal and tactical manuals. Built entirely with open-source components, it performs text preprocessing, embedding generation, and semantic search locally, ensuring full offline functionality. We demonstrate its current capabilities through use cases involving air combat manuals and discuss potential applications in onboard mission systems, flight simulators, and training environments. The source code is modular and intended for public release to support future extensions and integration.
Paperid:3370
Authors:Alejandro Mildiner · Michael Moreno · Viviana Siless
Abstract:
Access to formal credit remains a significant barrier for microentrepreneurs in Latin America. This work introduces a novel credit scoring methodology that integrates image-based cluster features extracted via Vision Transformer (ViT) models into a tabular classification pipeline. Using images submitted by loan applicants, we generate high-dimensional embeddings with ViT (Dosovitskiy et al., 2021), CLIP (Radford et al.,2021), and I-JEPA (Assran et al., 2023), and apply unsupervised clustering to discover latent visual patterns correlated with creditworthiness. Each applicant is then encoded based on their proximity to learned clusters; yielding categorical features that represent visual similarity to known payer profiles. These features are incorporated into XGBoost models alongside financial and demographic data. Our results show that visual-cluster-based features improve predictive performance and, they outperform a baseline model utilizing traditional indicators from credit bureaus and alternative data (AUC .79), reaching AUC .843 in the case of I-JEPA. This approach demonstrates how computer vision can provide interpretable, transferable insights from visual content, offering a new pathway toward fairer, more inclusive credit evaluation in underserved economies (Salcedo-Perez & Patino, 2018).
Paperid:3371
Authors:Joaquín H. Monti · Gastón Schlotthauer · Marcelo Colominas
Abstract:
Noninvasive fetal electrocardiogram (NI-FECG) is used to monitor the fetal heart health. Single-channel FECG extraction is a critical challenge for the development of long-term monitoring devices. Many approaches based on deep learning (DL) were proposed but impose prohibitive computational and memory requirements that limit the development of embedded devices. In this work, we propose a two-stage framework based on denoising autoencoder (DAE) models. Our approach sequentially isolates maternal and fetal components from abdominal signals. The models were trained and tested on simulated data and evaluated on real recordings. They achieved a median F1-score of 1.00 and 0.93 in QRS detection for simulated and real signals, respectively, and a median Pearson correlation coefficient of 0.89 for simulated data. Each model requires only 80,144 parameters (32-bit precision) occupying a total of 0.62 MB of memory -- representing a 5× reduction compared to the state-of-the-art approach while maintaining comparable detection accuracy.
Paperid:3372
Authors:Hugo Massaroli · Leonardo Iara · Emmanuel Iarussi · Viviana Siless
Abstract:
Large language models (LLMs) are increasingly deployed in realworld applications, yet concerns about their fairness persist—especially in high-stakes domains like criminal justice, education, healthcare, and finance. This paper introduces a transparent evaluation protocol for benchmarking the fairness of open-source LLMs using smart contracts on the Internet Computer Protocol (ICP) blockchain. Our method ensures verifiable, immutable, and reproducible evaluations by executing on-chain HTTP requests to hosted Hugging Face endpoints and storing datasets, prompts, and metrics directly on-chain. We benchmark Llama, DeepSeek, and Mistral models on two fairness-sensitive datasets: COMPAS for recidivism prediction and PISA for academic performance forecasting. Fairness is assessed using statistical parity, equal opportunity, and structured Context Association Metrics (CAT). We further extend our analysis with a multilingual evaluation across English, Spanish, and Portuguese using the Kaleidoscope benchmark, revealing cross-linguistic disparities. All code and results are open source, enabling community audits and longitudinal fairness tracking across model versions.
Paperid:3373
Authors:Vladimir Araujo · Marie-Francine Moens · Tinne Tuytelaars
Abstract:
Parameterefficient fine-tuning (PEFT) techniques are increasingly applied to pre-trained language models (PLMs) for continual learning (CL). Typically, these methods train a PEFT module for each new task and use similarity-based selection to route modules during inference. However, they suffer from two key issues: interference with previously trained modules and suboptimal module composition. We introduce L2R, a strategy that isolates the training of new PEFT modules to secure task-specific specialization. L2R then learns to combine these modules by training a router network that leverages a small memory of prior task examples. We evaluate our approach in two CL setups across various benchmarks. Our findings show that L2R effectively composes PEFT modules, resulting in improved generalization and performance compared to other methods.
Paperid:3374
Authors:Paula Leandra Loeblein Pinto · Gustavo Silva · Rafael Sousa
Abstract:
This study explores the use of Machine Learning to track anxiety in pregnant women through voice analysis of data collected in a maternal health initiative, processed with MFCC, OpenSMILE, and PyAudio in 8second windows. Ten classification algorithms were initially tested, with five thoroughly evaluated using AUC, F1-score, and accuracy. Logistic Regression using PyAudio features yielded the best results (AUC=0.705, F1-score=0.51), and demographic data improved performance, while ensemble methods may enhance stability, despite the limited dataset. Although promising, these findings require validation on a larger dataset, and the modeling will be refined for robustness and clinical applicability. These results suggest voice analysis represents a promising approach for anxiety screening in pregnancy, offering a potentially accessible and non-invasive tool for early identification and support.
Paperid:3375
Authors:Itzel Tlelo-Coyotecatl · Hugo Jair Escalante
Abstract:
The massive rise of multimodal content demands the need for automatic methods capable of handling such complex information. Tasks oriented to analyze video content can benefit from multimodal representation learning, so far, challenges related to high-dimensional data and the capture of cross-modal relationships remain open to research. Taking inspiration from attention mechanisms and locally linear embedding technique properties, we proposed two cross-modal variants that reconstruct one modality using information from another, aiming to leverage local neighborhood information and preserve geometric information across modalities. We evaluate our generated cross-modal representations on a video-based sarcasm detection task, showing competitive performance considering the dimensionality reduction. Our results and analysis indicate the potential impact of including geometry-aware and dimensionality reduction for multimodal approaches.
Paperid:3376
Authors:Rosa Encinas · Felipe Almeida
Abstract:
Abstract:Environmental monitoring of contaminated urban zones is fundamental in risk management and decisionmaking. While the use of Machine Learning in these contexts is on the rise, its application to underground gas and vapor monitoring data remains limited. This study evaluates Supervised ML models for analyzing multivariate time series of gas emissions---including CH$_4$, CO$_2$, O$_2$, H$_2$S, and CO---from contaminated soil and subsoil environments. The dataset comprises observations from 131 gas monitoring wells across 14 buildings on the USP Leste campus in São Paulo, Brazil, collected between 2014 and 2022. Five regression models were tested: Linear Regression, k-Nearest Neighbors, Decision Tree, Random Forest, and XGBoost. Model performance was assessed using $R^2$, MAE, and RMSE. Random Forest consistently achieved superior performance in the per-well modeling scenario (Experiment 1), demonstrating lower error rates and effective pattern recognition across individual monitoring wells. In the all-well configuration (Experiment 2), the best-performing model varied depending on feature composition, with DT, RF, and XGBoost excelling in different settings.
Paperid:3377
Authors:Agatha Mattos · Gavin McArdle · Michela Bertolotto
Abstract:
In this paper, we introduce SlumMapVisionBR, a new dataset composed of freely available mediumresolution images of over one hundred municipalities in Brazil and ground-truth data with slum areas for each location. To allow comparison with future studies, we benchmark this dataset utilising two machine learning methods: multispectral data paired with a random forests classifier and deep learning convolution neural networks. We report our results using common metrics used in slum mapping: accuracy and mean intersection over union. Our experiments show that deep learning performed, on average, better than multispectral data classification when both were evaluated using the same experimental setting. The average accuracy and mean intersection over union were 0.89 and 0.47 for the deep learning method and 0.77 and 0.42 for multispectral data classification. Our dataset and the code to reproduce results have been included in this paper submission and will be made available online upon acceptance of the paper.
Paperid:3378
Authors:Mariana Risco Cosavalente · Claudio Jung · Sharon C Quispe
Abstract:
This paper compares the performance of pretrained and fine-tuned Vision-Language Models (SigLIP2 Naflex and SigLIP Multilingual) for classifying disasters in images scraped from social media in Peru and Brazil. While zero-shot inference showed limited results, fine-tuning both models improved performance, with SigLIP2 slightly outperforming the others. The work highlights areas for further improvement in disaster detection models for Latin America.
Paperid:3379
Authors:Mateo Espinosa Zarlenga · Gabriele Dominici · Pietro Barbiero · Zohreh Shams · Mateja Jamnik
Abstract:
In this paper, we investigate how conceptbased models (CMs) respond to out-of-distribution (OOD) inputs. CMs are interpretable neural architectures that first predict a set of high-level concepts (e.g., "stripes", "black") and then predict a task label from those concepts. In particular, we study the impact of concept interventions (i.e., operations where a human expert corrects a CM’s mispredicted concepts at test time) on CMs' task predictions when inputs are OOD. Our analysis reveals a weakness in current state-of-the-art CMs, which we term leakage poisoning, that prevents them from properly improving their accuracy when intervened on for OOD inputs. To address this, we introduce MixCEM, a new CM that learns to dynamically exploit leaked information missing from its concepts only when this information is in-distribution. Our results across tasks with and without complete sets of concept annotations show that MixCEMs outperform strong baselines, significantly improving their accuracy for both in-distribution and OOD samples in the presence and absence of interventions.
Paperid:3380
Authors:Angie Marquina · Fidel Cruz · Juan Ochoa
Abstract: Accurate mass flow estimation of sugarcane under mechanized harvesting conditions is critical for operational efficiency in agriculture. Current approaches rely on sensor based methods, but they involve high repair costs, environmental sensitivity, and uncertain benefits. In this paper, we introduce MassFlow, a visionbased model that combines a 3D ResNet (R3D) backbone with optical flow-guided attention to capture spatiotemporal patterns under guided conditions. We propose a two-phase training approach: the first phase optimizes the alignment between flow and feature maps, and the second phase focuses on mass prediction and gradient regularization. Trained on real world videos from harvesting operations, our experiments shows that MassFlow achieves a mean absolute error (MAE) of 8.6% in per window evaluation and a mean absolute percentage error (MAPE) of 8.96 in per video evaluation, outperforming prior 2D-CNN baseline. These results demonstrate that combining spatiotemporal features with flow-guided attention improves mass estimation and reduces dependence on sensor-based systems.